Developer CD Series 1998 May: Tool Chest

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1998 May: Tool Chest / Dev.CD May 98 TC.toast / Tool Chest / Development Kits / • Other Platforms / PCCTS 1.31 / Documentation / manual.txt < prev next >

Wrap

Text File | 1995-03-10 | 160.2 KB | 5,767 lines | [TEXT/MPS ]

PCCTS Reference Manual Version 1.00 T. J. Parr, H. G. Dietz, and W. E. Cohen School of Electrical Engineering Purdue University West Lafayette, IN 47907 August 1991 pccts@ecn.purdue.edu [Note: please see release notes for all versions since 1.00] [Note: The Postscript version is much more readable] This document describes the structure and use of the Purdue Compiler-Construction Tool Set (PCCTS), a set of pub- lic domain software tools designed to facilitate the imple- mentation of compilers and other translation systems. Distribution of PCCTS is managed by an email server that reads the Subject: line of each email sent to pccts@ecn.purdue.edu. A missing or undeci- pherable Subject: line will cause the system to reply with email explaining how to use the server. Full C source code for PCCTS, various examples, and on-line documentation are available. PCCTS was developed within the School of Electrical Engineering at Purdue University, supported in part by NSF award number 8624385A3-CDR. As a public domain software release, Purdue University and the authors provide the sys- tem strictly on an "as-is" basis, without warranty or lia- bility. Despite this, bug reports and fixes are welcome. 1. Introduction This document describes structure and use of the Purdue Compiler-Construction Tool Set (PCCTS). In many ways, PCCTS is simi- lar to a highly integrated version of YACC [Joh78] and LEX [Les75]; where ANTLR (ANother Tool for Language Recognition) corresponds to YACC and DLG (DFA-based Lexical analyzer Generator) functions like LEX. However, PCCTS has many additional features which make it easier to use for a wide range of translation problems. PCCTS grammars contain specifications for lexical and syntactic analysis, intermediate-form construction and error reporting. Future releases will include an integrated code-generator generator to sup- port code-generation and other intermediate-form translations. Rules may employ Extended BNF (EBNF) grammar constructs and may define parameters, return values and local variables. Languages described in PCCTS are recognized via Strong LL(k) parsers constructed in pure, human-readable, C code. The Version 1.00 Release PCCTS package is considered public- domain and includes the following items: o C source code for antlr - parser generator. o C source code for dlg - DFA-based lexical analyzer generator. o C source for a symbol table manager with arbitrary scoping capa- bilities. o Grammars for ISO PASCAL and ANSI C with symbol table management. o C source for rexpr(char *expr, char *s) which answers whether s is in the language (regular expression) described by expr. o On-line documents - PCCTS Version 1.00 Reference Manual PCCTS was developed within the School of Electrical Engineering at Purdue University to aid in constructing various prototype and spe- cialized compilers. After using the system for over a year, we decided that it was mature enough to warrant release to a few test sites. Hence, in Spring 1990, an experimental version of the system, known as PCCTS B, was made available via anonymous ftp. Although the system was only announced via two short postings in network news- groups, by January 1991, there were over 450 known user sites world- wide - and we were hopelessly behind in responding to requests for hardcopy manuals. Now, only a year older but much wiser, we are finally ready for a full release of PCCTS - Version 1.00. Experience has proven that we are not able to deal with high-quantity hardcopy requests; instead, the basic documentation (this document) is being printed by SIGPLAN Page 1 PCCTS 1.00 Notices and all documentation is available on-line. The rest of this section describes the overall characteristics of PCCTS including its programming interface, translation to C code and input file format. Also, an example is presented to illustrate the development of parsers using PCCTS. 1.1. PCCTS programming interface PCCTS accepts language descriptions in the form of a set of gram- matical rules and token/lexeme definitions. Significant emphasis was placed on making the PCCTS programming interface simple and flexible. Much of the notation was derived from previous language tools and/or programming languages in order to bring a sense of familiarity and to attenuate the learning curve. For example, attribute notation ($i) and grammar syntax were derived mainly from YACC, though they have been augmented for use in the EBNF environment. Rules may define return types, parameter lists and local variables just as in a pro- gramming language. A mechanism for describing abstract-syntax-trees is also provided. PCCTS Grammars tend to be smaller and more readable than grammars written for other parser-generators because of the EBNF notation and because precedence rules are defined inherently within the grammar rather than delineated via pseudo-instructions (e.g. %left, %right in YACC). In addition, PCCTS translators (parser plus actions) are gen- erally easier to develop because of the extensive programming support. For example, init-actions allow the user to define local stack- based variables or execute a piece of C initialization code. Often, a distinct copy of a variable is required for each invocation of a rule. This requires the overhead of a software stack when using other language tools. PCCTS constructs a function for each rule which allows each rule invocation to automatically create its own copy of the user variables on the hardware stack. PCCTS parsers are constructed in pure, human-readable, C code. As a result, standard debugging tools can be used to trace and debug PCCTS parsers. Breakpoints can be set so that parser execution stops before or after certain grammar fragments of interest have been recog- nized. PCCTS parsers consist of if, and while statements which can be easily traced by a human observer. PCCTS supports intermediate-form (such as expression-trees) con- struction via a flexible abstract-syntax-tree (AST) mechanism which allows trees to be built explicitly or automatically. The user expli- citly creates trees via a LISP-like tree constructor or directs the automatic tree construction facility via simple grammar directives. AST tree nodes are user-defined and are generally a function of attri- butes or terminals. A default transformation from attributes/terminals ($-variables) to AST nodes can be specified. Alternatively, each tree node can be defined explicitly via an AST node constructor. Page 2 PCCTS 1.00 1.2. PCCTS information flow PCCTS currently consists of ANTLR (ANother Tool for Language Recognition) and DLG (DFA-based lexical analyzer generator) which con- struct parsers and lexical analyzers respectively. ANTLR accepts as input a complete language description including rules for both lexical and syntactic analysis and produces a C program that recognizes sentences in that language. A C file is constructed for each grammar description file. Also, three additional files - err.c, tokens.h and parser.dlg (all of which can be renamed) are gen- erated as support files. err.c contains error information and token class definitions. tokens.h is a list of #define's representing all of the token names defined or referenced in the grammar description files as well as function prototypes for rules that have return types. parser.dlg is a specification of the lexical analyzer generator derived from the user's language description and is translated to C by DLG. DLG creates an automaton that breaks up the input character stream into tokens used by the ANTLR-generated parser and it creates mode.h which is a list of #define's corresponding to the separate automatons defined by your grammar. The following is a graphical representation of the information flow during translation from parser description to executable parser. [Figure omitted] _______________________________________________________________________ _______________________________________________________________________| |f1 ... fn ||Input ANTLR grammar description files | |______________||______________________________________________________| |______________||______________________________________________________| |tokens.h ||List of token names found in f1 ... fn | |______________||______________________________________________________| |______________||______________________________________________________| |parser.dlg ||Scanner description prepared for DLG | |______________||______________________________________________________| |______________||______________________________________________________| |mode.h ||Definitions for lexical modes (#lexclass's) | |______________||______________________________________________________| |______________||______________________________________________________| 1.3. PCCTS file(s) format PCCTS descriptions adhere to a fairly flexible format and can be Page 3 PCCTS 1.00 spread out over many files. The actual grammar for PCCTS input can be defined in itself as: grammar : { "#header" Action } ( Action | tokenDef | eclassDef )* ( rule | tokenDef | eclassDef )+ ( Action | tokenDef | eclassDef )* Eof ; where Action is a user action (C code), tokenDef is a #token defini- tion and eclassDef is an #errclass definition. Lexical actions found in a language description are passed on to the parser.dlg file. The ( )* and ( )+ EBNF operators indicate 0 or more and 1 or more respec- tively. { } implies that the items within are optional. Rule grammar indicates that a PCCTS file has a #header action first (if at all) followed by actions, token definitions and error class definitions with the except that actions may not appear between rules. Also, at least one rule must be defined. There is no start symbol specifica- tion since any rule can be invoked first. The minimum required user C code is a main program that initial- izes the PCCTS-generated parser and calls the starting rule. The lex- ical analyzer is derived automatically (from the user's language description) by ANTLR and a user routine is not necessary to supply tokens to the parser. The main program does not need to be located in a PCCTS grammar file. ANTLR descriptions may be broken up into many different files, but the sequence mentioned above in grammar must be maintained. For example, if file f1.g contained #header <<#include "int.h">> << main() { ANTLR(start(), stdin); } >> and file f2.g contained start : "begin" VAR "=" NUM ";" "end" "." "@" ; Page 4 PCCTS 1.00 and file f3.g contained #token VAR "[a-z]+" #token NUM "[0-9]+" the correct ANTLR invocation would be antlr f1.g f2.g f3.g Note that the order of files f2.g and f3.g could be switched because a rule is either a #token or an actual rule definition. In this case, to comply with ANTLR's grammar, the only restriction is that file f1.g must be mentioned first on the command line. Other files may be included into the parser files generated by ANTLR via actions containing a #include directive. For example, <<#include "support_code.h">> If a file must be included in all parser files generated by ANTLR, the #include directive must be placed in the #header action. In other words, #header <<#include "necessary_type_defs_for_all_files.h">> Note that #include's can be used to define any ANTLR object (Attrib, AST, etc...) by placing it in the #header action. 1.4. Makefile template The template presented in this section can be used to construct makefiles for single-file grammars. The order of execution will fol- low that illustrated in section 1.2. Page 5 PCCTS 1.00 # # Makefile for a single-file grammar # ANTLR_H= path to antlr.h, dlgdef.h, dlgauto.h, standard attribute defs CFLAGS= -I$(ANTLR_H) -DLL_K = $(K) AFLAGS= GRM=user_grammar_file_name SRC=scan.c $(GRM).c err.c OBJ=scan.o $(GRM).o err.o K=number_of_tokens_of_lookahead # build an executable with same name as grammar file minus .g proj: $(OBJ) $(SRC) cc -o $(GRM) $(OBJ) # how to build parser files from grammar files $(GRM).c parser.dlg : $(GRM).g antlr $(AFLAGS) $(GRM).g # how to build scanner from lex description built by ANTLR scan.c : parser.dlg dlg -C2 parser.dlg scan.c 1.5. PCCTS component command-line options The PCCTS system is composed of ANTLR and DLG both of which have command-line options. Since each component is executed independently, each may be invoked with different options which are summarized in this section. ANTLR accepts the following command-line options -cr Generate cross reference of all rules. The default is FALSE. -e1 Only the first 3 tokens of any ambiguity set are displayed upon warning/error. i.e. { A B C ...}. -e2 Every token in an ambiguity set is displayed upon warning/error. -e3 Error levels 1 and 2 (-e1, -e2) are sometimes misleading for the sake of brevity. For instance, when a block is ambiguous upon the sequences A B C D Page 6 PCCTS 1.00 only { A C }, { B D } is displayed when in fact A D and C B are not ambiguous. The -e3 option makes ANTLR generate permutation trees in LISP-like notation (root child1 child2 ... childn) The following grammar illustrates the difference a : (A B|C D|A D|C B) | (A D|C B) ; ANTLR with error level 1 or 2 reports Antlr parser generator Version 1.00 1989, 1990, 1991 "t.g", line 1: warning: alts 1 and 2 rule ambiguous upon { A C }, { B D } ANTLR with error level 3 reports Antlr parser generator Version 1.00 1989, 1990, 1991 "t.g", line 4: warning: alts 1 and 2 rule ambiguous upon ( C B ) ( A D ) which indicates that rule a is only ambiguous upon the two sequences C B and A D. -fe fname Rename the file where error information, token sets, and zzto- kens[] are placed. Default is err.c. -fl fname Rename the file where ANTLR places the lexical description. Default is parser.dlg. Page 7 PCCTS 1.00 -ft fname Rename the file where ANTLR places the #define's for labeled con- stants and the function prototypes. Default is tokens.h. -ga Generate ANSI prototypes for functions constructed from rules. Default is not to generate ANSI-style prototypes. -gt Generate code for Abstract-Syntax-Trees. Default is not to gen- erate trees. -gc Do not generate output parser code or lexical description (implies -gx). This option can be used to check a grammar for ambiguities. Default is to generate parsers. -ge Generate an error class for each non-terminal of the form #errclass Rule { rule } Default is not to generate the classes. -gs When a token in the current look-ahead needs to be compared against two or more tokens, code for set membership is generated rather than a set of integer comparisons. This renders the parsers more efficient, but much more difficult to read for a human. When debugging an ANTLR parser with a source level debugger, this command-line option should be used to create more readable parsers. The default is to generate sets for any token expression list with two or more members. -gd Generate debugging code; trace rule invocation. This command- line option forces ANTLR to generate code that calls zzTRACE() with the rule name in which it appears as an argument. zzTRACE() can be a macro or a function. The default is not to generate debugging code. -gl Generate line information in PCCTS-generated parser that associ- ates a line in the user's grammar with a line in the C code. This is useful when the C compiler reports an error in a user action. It will report an error in the grammar file rather than an error in the C file. This dumps line information like the C preprocessor. -gx Do not generate a lexical description (dlg-related files). This is useful if you are positive nothing has changed in the lexical portion of your language description. For example, if you simply change an action in one of your rules, nothing has changed lexi- cally and you need not run DLG to recreate the lexical analyzer. The default is to generate a description. Page 8 PCCTS 1.00 -k n Set k, the number of look-ahead tokens, to n. Default is 1. -p Print out the grammar to stdout without actions. This option is useful for documentation purposes or to simply make viewing/understanding your grammar easier. The default is not to generate a grammar printout. -pa This option is identical to the -p option accept that the grammar is annotated with the results of grammar analysis. For example, the set of all tokens that can be matched at k tokens into the future are displayed after each point in your grammar where a decision has to be made. ANTLR prints out only the sets of tokens that would be needed to construct a parser. Rule a below illustrates the annotations. #header <<;>> a : (A B | A D) | Q ; Executing ANTLR with antlr -k 2 -pa t.g yields (cleaned up slightly) Antlr parser generator Version 1.00 1989, 1990, 1991 a : ( A B /* [1] { A }, { B } */ | A D /* [2] { A }, { D } */ ) /* [1] { A } */ | Q /* [2] { Q } */ ; The first alternative of rule a has a subrule with two alterna- tives which requires a parsing decision. Since both alternatives start with an A, another token of look-ahead must be examined to determine which alternative to choose. In this case, the second Page 9 PCCTS 1.00 token in each alternative allows us to choose. Therefore, two sets of tokens (singleton sets here) are printed out for each alternative in the subrule. The alternatives of rule a can be matched with the knowledge of only the first token and so only one set is printed. The default is not to generate a grammar printout with annotations. DLG has the following command-line options. -Cn Specify level of character class compression. 0 means no compression, 1 removes unused characters from DFA states, and 2 specifies that DLG is to combine characters into equivalence classes. It is suggested that -C2 be used since it will result in much smaller output file and does not require much additional processing time. -ga Produce ANSI C compatible code. -m filename Uses filename rather than mode.h for the lexical mode definition file. -i Attempts to build a lexical scanner for interactive programs. The non-interactive scanner always looks one character past the end of a token. This look-ahead is avoided whenever possible in the interactive version so that the pattern is recognized without any additional characters being read. Look-ahead is always used when necessary to correctly match input tokens. 1.6. Introductory example As a simple concrete example of the PCCTS system, consider the following translator which accepts a date in the typical American for- mat of mm/dd/yy and prints to stdout the same date, but in the Euro- pean format of dd/mm/yy. Page 10 PCCTS 1.00 #header <<#include "int.h">> << main() { ANTLR(date(), stdin); } >> #token "[\t\ \n]" << zzskip(); >> /* Ignore White */ #token Digits2 "[0-9][0-9]" /* Define mm, dd or yy */ date : Digits2 "/" Digits2 "/" Digits2 "@" <<printf("%02d/%02d/%02d\n", $3, $1, $5);>> ; This example is complete in the sense that PCCTS can generate all C files necessary to lexically and syntactically analyze a date and con- vert it to the European format. An executable can be obtained by per- forming the following sequence of commands (assuming the above grammar is in file date.g and antlr.h, the standard ANTLR header, is in the current directory). antlr date.g dlg parser.dlg scan.c cc date.c scan.c err.c Executing the file a.out (on UNIX [Trademark AT&T] systems) will allow the user to type in a date and have the European version printed out (after end of file is typed). Rule date is defined to be a sequence of two digits followed by a / followed by a sequence of 2 digits, and so on, terminated by the end of the input file (represented by "@"). Note that rule names begin with a lower-case letter. The action <<printf("%02d/%02d/%02d\n", $3, $1, $5);>> Page 11 PCCTS 1.00 contains C code that is executed after all elements in rule date have been recognized. C code for the action is delimited by << and >> (European quotes). Because the >> delimiter otherwise would be con- fused with C's right-shift operator within the C code of an action, right-shift is written as \>\>. Similarly, the $ and # characters normally have special meanings within the C code of a PCCTS action; hence, the literal character $ in the C code would be written as \$ and # would be written as \#. Although the placement of an action within a grammar determines when that action will be applied, it must also be possible for the action to be a function of the text which has been recognized. For example, PCCTS automatically makes $1 in rule date have a value, or attribute, which is a function of the text matched by the first occurrence of Digits2. In general, the attribute associated with the ith item in a rule is called $i. Both the type of the attributes and the mapping from text matched into its attribute value is specified by the user. The directive #header <<#include "int.h">> includes code that makes the values of "$ variables" be of the C type int with values defined by applying the C function atoi() to the text matched. Each item in the rules refers either to the text matched by a simple pattern (a regular expression) or to the collection of items matched by a rule (represented by the rule's name). Simple patterns can be stated directly, such as "/" in the example, or can be given names using the #token directive. The token definitions #token "[\t\ \n]" << zzskip(); >> /* Ignore White */ #token Digits2 "[0-9][0-9]" /* Define mm, dd or yy */ cause tab (\t), space (\ ), and newline (\n) characters between the other tokens to be ignored, and allow use of the name Digits2 as a shorthand for "[0-9][0-9]". Since the above is meant to be a stand-alone complete example and PCCTS does not generate a C main function, the user must provide a main. The main also specifies which rule is to be used to recognize the input and where that input will come from. The code Page 12 PCCTS 1.00 ANTLR(date(), stdin); specifies that the date rule will be used to recognize input taken from stdin (the standard input stream). Notice that, although there is no specification of how incorrect input should be handled by the generated program, PCCTS will automati- cally generate appropriate error messages and attempt to recover from the errors. The above example should give the reader a "feel" for how PCCTS can be used, however, the system supports much more than the example uses. This document is organized into six major sections: lexical analysis, syntactic analysis, attribute handling, error detection, miscellaneous items and PCCTS history. 2. Lexical analysis and token definition The task of language recognition is conveniently broken down into two phases - lexical and syntactic analysis. Each phase is considered a separate operation and, therefore, most language tools require separate specifications. For example, Coco-2 [DoP90] requires the user to create separate, but consistent, specifications of character sets, keywords and tokens (and even non-terminals). In contrast, PCCTS derives all this information from a single description. This both ensures that the information is consistent and makes it easier to view a PCCTS language specification as a whole. This section of the manual describes general lexical issues asso- ciated with PCCTS' integrated description including lexical action definitions, regular expression specification, and resolution of ambi- guities. Also, more specialized issues such as multiple automatons, interactive scanners, token positional information and input character supply are presented. 2.1. Token Labeling and token actions Tokens are defined either explicitly with #token or implicitly by using them as rule elements. Implicitly defined tokens can be either regular expressions (non-labeled tokens) or token names (labeled). Token names begin with an upper-case letter (rules begin with a lower-case letter). More than one occurrence of the same regular expression in a grammar description produces a single regular expres- sion in parser.dlg and is assigned one token number. Quoted and non- quoted tokens that refer to the same lexical object (expression) may be used interchangeably. Token names that are referenced, but not attached to a regular expression are assigned a number and given a #define definition in tokens.h. It is not necessary to label any Page 13 PCCTS 1.00 tokens in PCCTS. However, all tokens that you wish to explicitly refer to in an action, must be declared with a #token instruction. From PCCTS's point of view, it merely needs to know that a word is a token (as opposed to a non-terminal). Its value is irrelevant at the symbolic level of PCCTS's meta-language. The user may introduce tokens, lexical/token actions, and token labels via the #token instruction. Specifically, o Simply declare a token for use in a user action: #token VAR o Associate a token with a regular expression and, optionally, an action: #token VAR "[a-zA-Z][a-zA-Z0-9]*" #token Eof "@" << printf("Eof Found\n"); >> o Specify what must occur upon a regular expression: #token "[0-9]+" << printf("Found int %s\n", zzlextext); >> Token actions may employ two functions, zzreplchar() and zzreplstr(), to replace the text for the regular expression matched on the input stream. For example, if \n is found in a string for C, it needs to be converted to the actual newline character. Strings in C can be handled in the following manner. Page 14 PCCTS 1.00 /* START lexclass is in effect */ #token "\"" <<zzmode(STRINGS); zzmore();>> #lexclass STRINGS /* define a new automaton */ #token STRING "\"" <<zzmode(START);>> #token "\\\"" <<zzmore();>> #token "\\n" <<zzreplchar('\n'); zzmore();>> #token "\\r" <<zzreplchar('\r'); zzmore();>> #token "\\t" <<zzreplchar('\t'); zzmore();>> #token "\\[1-9][0-9]*" <<zzreplchar((char)strtol(zzbegexpr,NULL,10)); zzmore();>> #token "\\0[0-7]*" <<zzreplchar((char)strtol(zzbegexpr,NULL,8)); zzmore();>> #token "\\0x[0-9a-fA-F]+" <<zzreplchar((char)strtol(zzbegexpr,NULL,16)); zzmore();>> #token "\\~[\n\r]" <<zzmore();>> #token "[\n\r]" <<zzline++; zzmore(); /* print warning about \n in str */>> #token "~[\"\n\r\\]+" <<zzmore();>> #lexclass START Note the use of zzbegexpr which points to the beginning of the text matched for the currently recognized regular expression. zzendexpr is also available and points to the end of the expression. The text always has a '\0' on the end and zzendexpr points to the character just before. There is a big difference between a token action and a normal action found outside of the #token instructions. An action not included as part of one of the lexical commands is considered a syn- tactic action and is placed in the C parser file associated with that grammar file. If a grammar action follows a #token definition with no associ- ated lexical action, the #token instruction will assume that the action is really meant as a lexical action. This is sort of analogous to the dangling-else-clause dilemma. Example: #token VAR "[a-zA-Z][a-zA-Z0-9]*" << /* * This will be associated with the #token even though it is * probably a grammar action not a lexical action */ >> Page 15 PCCTS 1.00 This ambiguity can be solved by appending a null action to the #token: #token VAR "[a-zA-Z][a-zA-Z0-9]*" << ; >> 2.2. Lexical actions via #lexaction Often, it is convenient to have a section of user C code placed in the lexical analyzer constructed automatically by DLG (PCCTS's DFA-based lexical analyzer generator). Normally, actions not associ- ated with a #token pseudo-op or embedded within a rule are placed in the parser generated by ANTLR. However, preceding an action appearing outside of any rule, with the #lexaction pseudo-op directs the C code to the lexical analyzer. For example, << /* a normal action outside of the rules */ >> #lexaction << /* this action is inserted into the lexical * analyzer created by DLG */ >> All #lexaction actions are collected and placed as a group into the C file where the lexer resides. Typically, this code consists of func- tions or variable declarations needed by #token actions. 2.3. Multiple lexical classes ANTLR parsers employ DFA's (Deterministic Finite Automatons) created by DLG to match tokens found on the character input stream. More than one automaton (lexical class) may be defined in PCCTS. Mul- tiple scanners are useful in two ways. First, more than one grammar can be described within the same PCCTS input file(s). Second, multi- ple automatons can be used to recognize tokens that seriously conflict with other regular expressions within the same lexical analyzer (e.g. comments, quoted-strings, etc...). Actions attached to regular expressions (which are executed when that expression has been matched on the input stream) may switch from one lexical analyzer to another. Each analyzer (lex class) has a label which is used to enter that automaton. A predefined lexical class called START is in effect from the beginning of the PCCTS description until the user issues a #lexclass directive or the end of Page 16 PCCTS 1.00 the description is found. When more than one lexical class is defined, it is possible to have the same regular expression and the same token label defined in multiple automatons. Regular expressions found in more than one auto- maton are given different token numbers, but token labels are unique across lexical class boundaries. For instance, #lexclass A #token LABEL "expr1" #lexclass B #token LABEL "expr2" In this case, LABEL is the same token number (#define in C) for both "expr1" and "expr2". A reference to LABEL within a rule can be matched by two different regular expressions depending on which auto- maton is currently active. Hence, the #lexclass directive marks the start of a new set of lexical definitions. Rules found after a #lexclass can only use tokens defined within that class - i.e. all tokens defined until the next #lexclass or the end of the PCCTS description, whichever comes first. Any regular expressions used explicity in these rules are placed into the current lexical class. Since the default automaton, START, is active upon parser startup, the start rule must be defined within the boundaries of the START automaton. Typically, a multiple- automaton PCCTS grammar will begin with #lexclass START immediately before the rule definitions to insure that the rules use the token definitions in the "main" automaton. Tokens are given sequential token numbers across all lexical classes so that no conflicts arise. This also allows the user to reference zztokens[token_num] (which is a string representing the label or regular expression defined in the grammar) regardless of which class token_num is defined in. 2.3.1. Multiple grammars, multiple lexical analyzers Different grammars will generally require separate lexical analyzers to break up the input stream into tokens. What may be a Page 17 PCCTS 1.00 keyword in one language, may be a simple variable in another. The #lexclass PCCTS meta-op is used to group tokens into different lexical analyzers. For example, to separate two grammars into two lexical classes, #lexclass GRAMMAR1 rules for grammar1 #lexclass GRAMMAR2 rules for grammar2 All tokens found beyond the #lexclass meta-op will be considered of that class. 2.3.2. Single grammar, multiple lexical analyzers Again the #lexclass meta-op is used to indicate the beginning of a new lexical analyzer. For instance to define C-style comments, the following will suffice: /* lexclass START is in effect by default */ /* switch to COMMENT automaton */ #token "/\*" <<zzmode(COMMENT); zzskip();>> ... #lexclass COMMENT #token "\*/" <<zzmode(START); zzskip();>> /* switch back */ #token "~[\*]*" <<zzskip();>> /* ignore all stuff inside */ #token "\*~[/]" <<zzskip();>> ... 2.4. Handling end of input There is only one automatically defined regular expression - end-of-file which is represented by the regular expression @. An implicit #token "@" Page 18 PCCTS 1.00 is executed at the start of each lexical class (#lexclass). A label may be attached via the following in the user's grammar: #token Eof "@" A user could then refer to end-of-file as the #define constant Eof. 2.5. Token order and lexical ambiguities The order in which regular expressions are found in the grammar description file(s) is significant. When the input stream contains a sequence of characters that match more than one regular expression, (e.g. one regular expression is a subset of another) DLG is confronted with a dilemma. DLG does not know which regular expression to match, hence, it does not know which action should be performed. To resolve the ambiguity, DLG assumes that the regular expression which is defined earliest in the DLG input should take precedence over later definitions. Therefore, tokens that are special cases of other regu- lar expressions should be defined before the more general regular expressions. ANTLR automatically constructs the input to DLG by copying regu- lar expressions into the file parser.dlg. These definitions are copied in the order in which they are encountered among the rules and #token definitions. For example, a keyword is a special case of a variable and thus needs to occur before the variable definition. #token KBegin "begin" . . . #token VAR "[a-zA-Z][a-zA-Z0-9]*" 2.6. Quoted tokens The regular expressions in a PCCTS grammar appear between quote marks and are literally copied into the input for DLG. Therefore, the regular expressions must use the notation accepted by DLG - a rela- tively standard notation using a few meta-symbols. "a|b"Matches either the pattern a or the pattern b. "(a)"Matches the pattern a. Pattern a is kept as an indivisible unit. Page 19 PCCTS 1.00 "{a}"Matches a or nothing, i.e., the same as "(a|)". "[a]"Matches any single character in character list a. e.g. "[abc]" matches either an a, b or c and is equivalent to "(a|b|c)". "[a-b]" Matches any of the single characters whose ASCII codes are between a and b inclusively, i.e., the same as "(a|... |b)". "~[a]" Matches any single character except for those in character list a. "~[]" Matches any single character; literally "not nothing." "a*" Matches zero or more occurrences of pattern a. "a+" Matches one or more occurrences of pattern a, i.e., the same as "aa*". "\a" Matches the single character a - even if a by itself would have a different meaning, e.g., "\+" would match the + character. This is consistent with the use of \ in the C language. In addition, the symbols %%, <<, and >> are used to specify control for DLG; hence, these patterns must be escaped if they are to appear as literals within a regular expression. For example, << would be \<\<. The following characters are also special. @ Matches end-of-file. \t Tab character \n Newline character \r Carriage return character \b Backspace character \0nnnMatches character that has octal value nnn \0xnnMatches character that has hexadecimal value nnn \mnn Matches character with decimal value mnn, 1<m<9 \c Character escape White space characters (tab and space) are ignored within the quota- tion marks. To specify the actual space character use "\ ". Page 20 PCCTS 1.00 2.7. Interactive lexical analyzers Normally, the scanners produced by DLG always have one character of look-ahead available. This look-ahead causes the scanner to look one character past the end of the current pattern and wait for the next character if it is not yet available. This can cause unexpected behavior from interactive programs such as command line interpreters which must respond to input upon newline. Normally, the scanner would seek a character beyond the newline; forc- ing the user to type an additional character before returning the car- riage return token to the parser. The user may employ the DLG -i command-line option to request that unneeded look-ahead be removed whenever possible. This allows interactive parsers such as the interpreter mentioned above to get tokens without the user having to type additional characters. On some systems special flags need to be set, so that the scanner actually gets the characters when they are received rather than having the operating system buffer them. On BSD unix the cbreak mode needs to be used to accomplish this (e.g. stty cbreak). There are situations where the character look-ahead is still required. Typically, this occurs when two patterns have the same pre- fix. For example, patterns A and Ab can only be distinguished by exa- mining the character following pattern A. A b character would indi- cate that the token associated with the second pattern should be returned. Also, any pattern that has a closure, + or *, must have look-ahead to determine the end of the pattern, since the pattern by definition can be repeated indefinitely. 2.8. DLG lexical input The user can specify that DLG is to take its input from a stream or from a function. The function must behave like getchar() by returning a new character per invocation. The function must return -1 when no more characters are available to signal "end-of-file." Presumably, the function would read characters from some buffer and hand them one by one to DLG upon request. For example, a user might define the following. char text_edit_buf[BIG]; int nextchar() { static char *p = &text_edit_buf[0]; return (p < &text_edit_buf[BIG]) ? *(p++) : -1; } Page 21 PCCTS 1.00 Then, when invoking the parser, the user would call: { ... ANTLRf(start_symbol(), nextchar); ... } 2.9. Tracking pattern position There are several global variables available in the scanners pro- duced by DLG to keep track of the position of a pattern within an input stream. zzline tracks vertical position and must be explicitly updated by the user's grammar. zzbegcol and zzendcol track horizontal position-column number within a line. They are maintained by the DLG automaton for characters that are normally one character wide. zzendcol needs to be corrected for cases when the character is not one character wide, e.g. tabs or carriage return. To enable the position tracking, a #define ZZCOL needs to be put in an action in the lexical specification or -DZZCOL needs to be included in the command line when compiling the C code for the lexical analyzer. This section described how a stream of input characters can be collected into a sequence of tokens and how one can act upon the recognition of particular tokens. Sentences in a language are com- posed of tokens and are described using a set of rules. The next sec- tion presents the PCCTS meta-language for describing language syntax. 3. Syntactic analysis Once the lexical structure of a language has been specified, the way in which tokens can be combined to form sentences must be described. The set of valid sentences in a language is described by a set of rules, written in an appropriate meta-language, that impose a structure on the sequence of input tokens. In most parser-generator tools, context-free grammars are speci- fied in a notation which is equivalent to BNF [Nau63]. However, the notation used in regular expressions to permit repetition, alterna- tives, etc. can be used to extend BNF (EBNF) so that context-free grammars can be expressed more concisely. This also has the advantage of being more consistent with the regular expression notation and pro- viding additional information which can be used to create a more effi- cient parser. For these reasons, PCCTS accepts an EBNF-like grammar notation. The input for PCCTS differs from EBNF primarily in that ::= is Page 22 PCCTS 1.00 replaced by : and non-terminals are distinguished from terminals by requiring all non-terminals to begin with a lowercase letter, rather than by bracketing non-terminals with <>. Further, since rules can become long when actions are embedded, each rule is terminated by a ; symbol. These modifications are not arbitrary, but mirror the differ- ences between YACC grammars and BNF (although PCCTS provides a number of extensions beyond YACC). As an example rule, consider: block: "begin" (statement)* "end" ; It indicates that rule a is defined to be the word "begin", followed by zero or more statement's, followed by the word "end". This section describes the syntax of PCCTS rules and the method by which rules communicate at run-time. 3.1. PCCTS rule definitions A rule is formally defined, in PCCTS's own notation, as: rule : NonTerminal /* rule name */ { "!" } /* turn off automatic AST construction */ { {"\<"} InheritAction } /* downward-inheritance ("input") */ { "\>" InheritAction } /* upward-inheritance ("output") */ { QuotedTerm } /* name for error reporting */ ":" block ";" /* actual rule elements */ { Action } /* fail action */ ; block : alt ("\|" alt)* ; alt : ( element )* ; element : TokenTerm { "^" | "!" } | QuotedTerm { "^" | "!" } | NonTerminal { "!" } { {"\<"} InheritAction } { "\>" InheritAction } | Action | "$" block "$" { "\*" | "\+" } { InheritAction } | "\{" block "\}" { InheritAction } ; Page 23 PCCTS 1.00 where InheritAction is a C expression enclosed in [ ] used for both upward- and downward-inheritance. QuotedTerm is a regular expression and TokenTerm is a label associated with some lexeme. 3.1.1. Subrules Rules that employ EBNF constructs implicitly define subrules according to the following syntax: (i) b =(a | a | ... | a ); match 1 time i 1 2 m [Figure omitted] (ii) b ={a | a | ... | a }; match 0 or 1 times i 1 2 m [Figure omitted] (iii)b =(a | a | ... | a )*; match 0 or more times i 1 2 m [Figure omitted] (iv) b =(a | a | ... | a )+; match 1 or more times i 1 2 m [Figure omitted] where a represents an alternative (list of rule elements). i With each enclosing set of meta-symbols, a new level of alterna- tives is permitted. Each rule or subrule thus created is called a "block." In terms of the recognizer generated, blocks generated for rules and subrules are virtually indistinguishable. Subrules are like macro expansions of rules. However, there is no label associated with a subrule and thus cannot be referred to by another rule or from else- where within that rule. Consider the following rule which uses EBNF constructs: e12 : {"\+" | "\-"} VAR | (FuncName | ProcName) parameterList | "[0-9]+" ; Rule e12 has 3 alternatives. The first alternative has an optional block with 2 alternatives as its first element. The { } is an optional subrule. VAR is a simple token. The optional block says that a VAR can have a plus or minus in front, but it is not necessary. The second alternative of rule a also begins with a subrule. The block indicates that this alternative must begin with a function or Page 24 PCCTS 1.00 procedure name. parameterList is some other rule which presumably looks for a list of parameters. Alternative 3 is simply a token which recognizes an integer number. Subrule meta-symbols alter the associativity of the | alternative operator and can be nested to any number of levels. |'s have the lowest precedence. a: B | C D ; Rule a specifies that the parser should match either B or C D. It will not match a D preceded by a B or C. The following version of a will accomplish that: a: (B | C) D; Empty alternatives called epsilon transfers can also be used to imply optionality. They are useful when you want an action to be per- formed when nothing is matched by the other alternatives of the block. a: B | C | ; b: { B | C } ; Rules a and b are identical in what they match. However, the resulting C code could be very different. Also note that with rule a, an action can placed in the empty alternative to be executed when a B or C is not matched. Rule b does not have this capability. Subrules are allowed at most one parameter which must be of type Attrib. The attribute $0 receives the inherited parameter for subrules. Subrules return values in $0 as well (which become the appropriate $i in the enclosing subrule/scope above). $0 is initially undefined in rules and in subrules with no explicit parameter or default $0 initialization. See section 3.1.2. 3.1.2. Rule communication An advantage of LL parsers over LR parsers is that inheritance (rule/subrule communication) is relatively simple and efficient. Page 25 PCCTS 1.00 PCCTS allows each non-terminal to have an arbitrary set of downward- inherited (parameters) and upward-inherited (return value) attributes (see section 4.1.6). The specification syntax is presented in this section whereas the use of rule communication is illustrated in sec- tion 4.1.6.2: rule[parameters] > [return_type(s)] : X | X | ... | X ; 1 2 m which provide more powerful attribute handling than can be achieved with the "$ variables" found in systems like YACC (and also supported in PCCTS, primarily to ease the transition for users familiar with YACC and to communicate with the lexical analyzer). Communicating with other rules should be as flexible as it is in a programming language. Hence, in addition to the standard attribute mechanism, PCCTS supports the familiar high-level language parameter passing scheme. The arguments to a rule must be consistent with the parameter specifications given with the definition of the rule. For example, << #define GLOBAL 1 #define LOCAL 2 >> prog : <<Sym *vars;>> TYPE dlist[GLOBAL] > [vars] <<OperateOn(vars);>> ... ; /* Recognize a list of declarations. The scoping level is passed * in and the list of symbols is returned. */ dlist[int level] > [Sym *list] : <<$list=NULL;>> VAR <<add($list, $1, $level);>> ( "," VAR <<add($list, $2, $level);>> )* ; where add() is some function that adds an element to a list. Here we have used the attributes only as a means of communicating with the Page 26 PCCTS 1.00 lexical analyzer. In contrast to most programming languages, more than one value can be returned. Each return value is set by simply referring to it by the name given in the return value definition (prefixed with $) - e.g. $list in the dlist example. Subrules may not have return types nor parameter list defini- tions. They may only pass a single attribute which becomes the ini- tial value of $0 in the subrule.[1] To illustrate the effect of rule communications upon local vari- ables and scoping, consider: rule[int a, Attrib b] > [float q, int r] /* a,b visible to rule */ : <<int i,j; printf("starting rule a\n");>> /* i,j visible to rule */ ... ( <<int i;>> /* i hides rule-local i */ ... ( <<int i,r;>> /* def hides outer scope i; local r can't be used */ <<$r=5;>> /* $r is return value not local var */ ) ) ; Note that return values and parameters are always given priority over local variables. While generating C code for a rule, ANTLR scans the actions and translates references to return values and parameters irrespective of scope. 3.1.3. Miscellaneous notes The user may attach a string to a non-terminal that is passed on to the error reporting macro/function zzsyn(). See section 5 on error reporting for more details. The user may elect to have PCCTS automatically construct abstract-syntax-trees (AST's). To disable automatic AST construction for a rule, employ the ! as indicated above in the rule grammar. See section 4.2 on abstract-syntax-trees. _________________________ [1] An earlier version of PCCTS treated subrules and rules identically with respect to parameters and return values. The current distinction represents an extension to the handling of rules. Page 27 PCCTS 1.00 3.1.4. LL(k) parsing ANTLR attempts to build parsers that use as little look-ahead as possible. Ambiguities force ANTLR to attempt resolution using more and more tokens of look-ahead. For example, ANTLR tries to resolve LL(1) ambiguities with LL(2). If the grammar construct is still ambi- guous, LL(k>2) is tried until either all ambiguities are resolved or the user-defined maximum k has been reached. A subrule or rule which requires LL(k>1) does not constrain all others to LL(k>1). ANTLR builds mixed model parsers where each parsing decision is made with the smallest of amount of look-ahead possible. The -k command-line option is used to bound the maximum number of tokens of look-ahead to be used by ANTLR-generated parsers. Most programming language constructs that cannot be described in LL(1) can be easily described in LL(2). Typically, LL(2) is only required in a handful of rules yielding a general LL(1) parser with a few anomalies. For example, when parsing the C programming language, a left-parenthesis could be the beginning of a type-cast or the begin- ning of an expression. With two tokens of look-ahead a parser could use the token following the parenthesis to determine whether an expression or type-cast followed. A more common problem lies in the definition of labels. The following grammar fragment illustrates the problem. stat : { WORD ":" } /* label definition */ ( | "if" ... | "while" ... . . | WORD "\(" ... /* forward ref to function */ | FUNC "\(" ... /* ref to function already seen */ ) ; With only one token of look-ahead, function stat() could not determine whether a WORD on the input stream was to be matched as a label or a yet to be seen function. However, if two tokens or more were avail- able, the colon or left-parenthesis following WORD would unambiguously indicate which alternative to match. 3.2. Grammar ambiguities PCCTS parsers are of the top-down, recursive descent variety which implies that all parsing decisions must be left-decidable- decidable given a finite sequence of input tokens looking from left to right across productions. Because of this limitation, there are gram- mars for which PCCTS cannot construct a parser. However, any language Page 28 PCCTS 1.00 can be described by several, equivalent grammars. Ambiguities that arise in a grammar may often be circumvented by modifying that grammar while still recognizing the same language. A grammar is considered ambiguous (in PCCTS's opinion) if one input sequence of k tokens can be matched by two different productions/alternatives within the same block. For example, if k equals one, a : A B | A C ; is ambiguous because if all the parser "sees" is A, it will be unable to choose between alternatives one and two. ANTLR reports the follow- ing error message: Antlr parser generator Version 1.00 1989, 1990, 1991 "t.g", line 1: warning: alts 1 and 2 ambiguous upon { A } If we allow two tokens of look-ahead (with -k 2), a parser would see either A B or A C and could uniquely determine which production to match. If subrules and/or rule references are used, the set of possi- ble input permutations grows quickly. a : A (B | D) E F H | A b E G H ; b : B ; Rule a is ambiguous when k equals one or two since A B can start either production. If we increase k to three, rule a is still ambigu- ous because A B E begins both productions. Only when k is set to four is a unambiguous. ANTLR applies a similar logic when constructing parsers. When an ambiguity occurs at k==i, k=i+1 is tried until either the ambiguity is resolved or the maximum look-ahead defined by the user is reached (in which case the ambiguity is reported to the Page 29 PCCTS 1.00 user). ANTLR reports the following for k==2, Antlr parser generator Version 1.00 1989, 1990, 1991 "t.g", line 1: warning: alts 1 and 2 ambiguous upon { A }, { B } Note the sequence A B is denoted by the singleton set { A } followed by the singleton set { B }. In general, ambiguous input sequences of k tokens are represented by a sequence of k sets. The -pa command-line option can be used to print out the user's grammar (minus actions) annotated with the input token permutations. The set sequence { A B }, { C D } means that the following four input token permutations are possible: A C A D B C B D 3.3. Look-ahead size The user's grammar can be analyzed using any integer k value (limited by machine size and how long the user wants to wait). How- ever, parsers constructed by ANTLR always use k values equal to the nearest power of two greater than or equal to the k specified by the user. ANTLR parsers treat the look-ahead as a circular buffer in which a "modulo k" operation is performed repeatedly. If k is a power of two, this operation reduces to a logical-and operation whereas other values generally require a divide - i.e. a time consuming opera- tion. Since the modulo operation is performed at least once for each input token, a divide operation would seriously degrade parser perfor- mance. The #define constant LL_K is set by ANTLR to the k associated with parser execution. Note that files not constructed by ANTLR will need to define LL_K if they include antlr.h. This can be handled with a "-DLL_K=k" command-line option on most C compilers. 3.4. ANTLR parser construction ANTLR generates parsers in pure C code--i.e. non-interpretive, mutually-recursive sets of functions. There is exactly one C function for every rule present in an ANTLR grammar description. Transforma- tions exist to convert grammar constructs to C language constructs. To perform lexical analysis, ANTLR prepares a DLG input file that will be converted to a C scanner by DLG and then linked into the parser. Page 30 PCCTS 1.00 Because of ANTLR's flexibility, C code generation templates vary depending on command-line options and #define's [Par90]. ANTLR parsers are fast primarily because they never have to back up and map EBNF grammar constructs to efficient target language constructs. The current k tokens of look-ahead always tells the parser which alterna- tive to choose. Also, C's efficient function invocation mechanism allows references to other rules to be handled quickly. 3.4.1. Efficiency The way grammars are described has a big impact on phrase recog- nition speed. Rules invoking other rules with tail-recursion (right- recursion) like /* S l o w A n d U g l y */ a: B c ; c: a | ; can be replaced with /* F a s t e r B u t S t i l l U g l y */ a: B (a | ) ; or, better yet, /* F a s t e s t A n d R e a d a b l e */ a: B (B)* ; which is more concise and infinitely more readable. This example leads us to two important generalizations: o Subrules are much faster than invoking another rule to match the same subgrammar. o Use the ( )* and ( )+ closure operators instead of tail recur- sion. This generates a while loop instead of recursive function Page 31 PCCTS 1.00 calls. Rules or subrules with many alternatives tend to generate bigger and slower programs. This arises from the fact that a linear search implemented by a sequence of if statements is used to determine which alternative to choose. Although general LL(k) parsing makes it diffi- cult to optimize this search for the correct alternative, future ver- sions of PCCTS will probably make a number of special-case improve- ments, such as generating a switch statement for each alternative selection problem which is LL(1). In many parser-generators, such as YACC, adding actions to a grammar can cause rules to be split, introducing ambiguities and/or extra states in the generated parser. In contrast, actions have no affect on the structure of the recognizers constructed by PCCTS. PCCTS simply generates C code templates that implement the recognizer and embeds C code for the user-supplied actions at the proper points. User actions have no significant affect on parsing speed. 3.4.2. Debugging parsers PCCTS parsers have one function for each rule. The function name will coincide with the rule it was constructed from and therefore can be referenced from within a source-level debugger. Breakpoints can be set so that parser execution stops before or after certain grammar fragments of interest have been recognized. PCCTS parsers are easy to follow because parsing decisions are made using simple if and while statements. Often it is convenient to track the sequence of rule invocations made in an PCCTS parser. The -gd command-line option forces PCCTS to generate code that calls zzTRACE() with the rule name in which it appears as an argument. zzTRACE() can be a macro or a function. 3.5. PCCTS parser template The following code template can be used to begin parser develop- ment. Page 32 PCCTS 1.00 /* * A N T L R P a r s e r T e m p l a t e * * Terence Parr, Hank Dietz and Will Cohen * CARP Research Group * Purdue University Electrical Engineering * PCCTS Version 1.00 */ #header << /* Define Attrib/AST or include standard package */ >> /* O p t i o n a l */ #token "[\t\ ]+" << zzskip(); >> /* Ignore White */ #token "\n" << zzline++; zzskip(); >> /* Track Line # */ << main() { ANTLR(startSymbol(), stdin); } zzcr_attr(attr, token, text) Attrib *attr; int token; char *text; { /* *attr = Function(text, token) */ } zzd_attr(attr) Attrib *attr; { /* destroy attribute */ } zzcr_ast(ast, attr, tok, text) AST *ast; Attrib *attr; int tok; char *text; { /* *ast = Function(attr) */ } Page 33 PCCTS 1.00 /* node constructor for use with #[] */ AST zzmk_ast(node, user_args) AST *node; /* newly created node */ { /* fill in node->fields */ return node; } zzd_ast(ast) AST *ast; { /* destroy AST node pointed to by 'ast' */ } #define zzdef0(zero) /* *(zero) = initial $0 attrib */ >> /* P u t R u l e s H e r e */ 3.6. Grammatical actions Actions are blocks of C code enclosed in << and >>. They are placed in the ANTLR-generated C files according to position in the corresponding grammar file. In general, an action following a token or non-terminal reference is executed after that grammar element has been recognized. There is no restriction concerning their placement within rules. However, actions cannot be specified between rules. They may only be used before and after the set of rules. 3.6.1. Init Actions If an action is specified before any grammar element is found after the : in a rule definition or after the start of a subrule, it is placed before any ANTLR generated C code for that rule or subrule. This allows the user to define variables and execute C code in a scope local to that rule or subrule; any C code given is executed just before the rule or subrule is applied. For example, consider the definition of a rule a: a : <<List *p=NULL;>> /* Get a type followed by a list of vars */ Type ( <<int i=0;>> Var <<append(p, i++, $1);>> )* <<OperateOn(p);>> ; Page 34 PCCTS 1.00 The init action <<List *p=NULL;>> defines a local auto variable called p and initializes it to NULL. The subrule also has an init action, <<int i=0;>>, which is executed every time the subrule is entered. The init actions of optional blocks are executed whether or not the elements in the block are recognized. This is due to the fact that subrules (C code block) must be entered in order to determine whether or not that grammar construct can be matched on the input stream. This can be confusing, but on the other hand, we can force a variable to have a value even if nothing is matched in an optional block. For example, a : <<Sym *p;>> ... { <<p=NULL;>> ":" Type <<p=$2;>> } <<OperateOn(p);>> ... ; The init-actions of closure blocks (( )* subrules) are executed only once - upon entry to the block. This allows local variable ini- tialization within closure blocks. Only $0 can be referenced in an init-action since nothing will have been matched when the action executes. Default $0 initialization occurs before init-actions are executed. 3.6.2. Fail actions Fail actions are actions that are placed immediately following the ; rule terminator. They are executed after a syntax error has been detected but before a message is printed and the attributes have been destroyed (optionally with zzd_attr(); see the section on attri- butes). However, attributes are not valid here because one does not know at what point the error occurred and which attributes even exist. Fail actions are often useful for cleaning up data structures or free- ing memory. For example, a : <<List *p=NULL;>> ( Var <<append(p, $1);>> )+ <<OperateOn(p); rmlist(p);>> ; <<rmlist(p);>> Page 35 PCCTS 1.00 The ( )+ loop matches a lists of variables (Var's) and collections them in a list. The fail-action <<rmlist(p);>> specifies that if and when a syntax error occurs, the elements are to be free()'ed. 3.6.3. Actions appearing outside of rules When PCCTS is used to construct a complete compiler, it is useful to be able to incorporate support code written in C in the same file. For example, there must always be a main specified and the ANTLR- generated parser must be invoked using the ANTLR macro. Hence, PCCTS allows arbitrary C code to appear as "actions" outside of the grammar rules. The PCCTS description << #include "junk" static int local_to_the_parser; main() {ANTLR(a(), stdin);} >> a: a_stuff ; b: b_stuff ; << hello() { printf("Hello?"\n"); } >> will result in the following output C code: #include "junk" static int local_to_the_parser; main() {ANTLR(a(), stdin);} a() { ... code to recognize a_stuff ... } b() { ... code to recognize b_stuff ... } hello() { printf("Hello?"\n"); } Note that although these segments of C code look like "actions," and the same rules apply concerning the use of escape characters, they Page 36 PCCTS 1.00 are not actions per se. Hence, no attribute values are accessible. 4. Attribute handling Given a grammar, PCCTS constructs a C program that recognizes phrases in the language described by that grammar. No translation from source language to output language is performed; the resulting parser merely complains if a syntax error is detected. To effect a transformation, actions must be embedded within the grammar that either perform an immediate translation or create an intermediate-form which is translated at a later time. Grammar actions have access to lexical information via run-time objects called attributes which are a function of the token number and the text of the lexeme found on the input stream. PCCTS also supports the creation and manipulation of abstract-syntax-trees (AST's) to handle more complicated transforma- tions. This section describes the use of attributes, attribute inher- itance as well as other forms of rule communication, and the handling of AST's. 4.1. General attribute handling Attributes are objects that are associated with all rules and rule elements including block elements like { }. Attributes are gen- erally identified by $i (where i is a positive integer) and are used in actions to identify the values of rule elements. Attributes behave like variables. They are mentioned and labeled at analysis-time, but have values only at run-time. The positive integers uniquely identify an element within the currently active block and within the current alternative of that block. With the invocation of each new block, a new set of attributes becomes active and the previously active set is temporarily inactive (unless $i.j, $$ or $inheritance_attr attributes are used; see sections 4.1.4). Attributes are scoped exactly like local stack-based variables in C. $-variables refer to user-defined objects associated with each rule element. The zzcr_attr() function transforms lexical objects into grammar attributes. Attributes associated with terminals are implicitly created and are a function of lexemes. However, attributes of non-terminals and subrules are explicitly manipulated by actions provided by the user. Terminals, which are lexical tokens, simply have a value associated with them and are generally referenced not reset by actions. Attributes that are returned from or passed to blocks are considered synthetic attributes. Rules and subrules may communicate via attributes or with familiar high-level language param- eter passing techniques (see sections 3.1.2 and 4.1.6). PCCTS requires that the user define the data type or structure of the attributes as well as specify how to convert from lexemes to attributes. This provides significant control over what information attributes carry and how the lexical analyzer should react to lexical objects found on the input stream. The type of the attributes is always defined by a C typedef named Attrib and must be defined before antlr.h is included in your C program. Using the PCCTS #header instruction, the user may specify the C definition or #include the Page 37 PCCTS 1.00 file needed to define Attrib. The action associated with #header is placed in every C file generated from the input grammar files. Any C file created by the user that includes antlr.h must once again define Attrib before #include'ing antlr.h. Attributes are stored and accessed in stack fashion. With the recognition of each element in a rule, a new attribute is pushed on the stack-i.e. becomes available. In other words, the ith attribute is not available until the ith rule element has been matched. If a rule element is something other than a simple token, many attributes may be pushed on the attribute stack until they reduce to a single attribute for that rule reference or subrule. For example, if attributes are to be simple integers, the follow- ing C definition is sufficient and necessary typedef int Attrib;. However, this does not specify how those integer attributes are con- verted from character strings in the input stream. 4.1.1. Attribute creation The attributes associated with terminals must be a function of the text and the token number associated with that lexeme. These values are passed to zzcr_attr() which computes the attribute to be associated with that lexeme. The user must define a function or macro that has the following form: zzcr_attr(attr, token, text) Attrib *attr; /* Pointer to attribute associated with this lexeme */ int token; char *text; /* text associated with lexeme */ { /* *attr = F(text,token); */ } Consider the following Attrib and zzcr_attr() definition. Page 38 PCCTS 1.00 typedef union {int ival; float fval;} Attrib; zzcr_attr(attr, token, text) Attrib *attr; int token; char *text; { switch ( token ) { case INT : attr->ival = atoi(text); break; case FLOAT : attr->fval = atof(text); break; } } The typedef specifies that attributes are integer or floating point values. When the regular expression for a floating point number (which has been labeled FLOAT) is matched on the input stream, zzcr_attr() converts the string of characters representing that number to a C float. The next attribute example is more involved. Attributes are either symbol table entries or integer values. Variables (VAR), intrinsic types (BuiltInType) and parameters (PARAMETER) found on the input stream are converted to symbol table pointers. Integers are converted to C int's. /* Assume Sym, zzsymget() are defined elsewhere */ #header <<typedef union { Sym *var; int val; } Attrib;>> << Sym *cursym; main() { ANTLR(prog(), stdin); } Page 39 PCCTS 1.00 zzcr_attr(attr, token, text) Attrib *attr; int token; char *text; { switch ( token ) { case WORD : case BuiltInType : case PARAMETER : case VAR : attr->var = cursym; break; case INT : attr->val = atoi(text); break; } } >> #token WORD "[a-zA-Z_][a-zA-Z_]*" <<{ extern Sym *cursym; cursym = zzsymget(zzlextext); if ( cursym == NULL ) cursym = zzsymnew(zzlextext); else LA(1) = cursym->token; }>> prog : ... VAR ... ; When a WORD is matched on the input stream, the character string is looked up in the symbol table. If it exists, a pointer to it is placed in cursym and the current look-ahead token (LA(1)) is modified accordingly. Otherwise, a symbol table record is created leaving cur- sym pointing to it. Also of note, we cannot place the zzsymget() call into zzcr_attr() because of the look-ahead. zzcr_attr() is called after the token has been recognized; modifying LA(1) in zzcr_attr() would have no effect. Parsing decisions are made based upon current look-ahead and, therefore, the lexical action must "compute" the token value before the parser receives it. 4.1.2. Attribute destruction The user may elect to "destroy" all attributes created with Page 40 PCCTS 1.00 zzcr_attr(). A macro or function called zzd_attr(), is executed once for every attribute. This occurs collectively at the end of every block. zzd_attr() is passed the address of the attribute to destroy. This can be useful when memory is allocated with zzcr_attr() and needs to be free()'ed. For example, sometimes zzcr_attr() needs to make copies of some lexical objects temporarily. Rather than explicitly inserting code into the grammar to free these copies, zzd_attr() can be used to do it implicitly. Notice that this concept is similar to the constructors and destructors of C++ [Str87]. Attributes are often character strings. Copies of the lexical text buffer (managed by DLG) are made which later need to be deallo- cated. This can be accomplished with code similar to the following. #header <<#define zzd_attr(attr) {free(*(attr));} typedef char *Attrib;>> << zzcr_attr(attr, token, text) Attrib *attr; int token; char *text; { extern char *malloc(); if ( token == StringLiteral ) { *attr = malloc( strlen(text)+1 ); strcpy(*attr, text); } } >> 4.1.3. Standard attribute definitions Some typical attribute types are defined in the PCCTS include directory. These standard attribute types are contained in the fol- lowing include files: charbuf.h Attributes are fixed-size buffers, each of 32 characters in length. If a string longer than 31 characters (31 + 1 '\0' ter- minator) is matched for a token, it is truncated to 31 charac- ters. The user can change this buffer length from the default 32 by redefining ZZTEXTSIZE before the point where charbuf.h is included. The text for an attribute must be referenced as $i.text. Page 41 PCCTS 1.00 int.hAttributes are int values derived from tokens using the atoi() function. charptr.h, charptr.c Attributes are pointers to dynamically-allocated variable-length strings. Although generally both more efficient and more flexi- ble than charbuf.h, these attribute handlers use malloc() and free(), hence, they can cause memory fragmentation. The file charptr.c must be #include'd, or linked with, the C code PCCTS generates for any grammar using charptr.h. In addition to the standard library of attribute handlers, the user may define new attribute types and handlers - which should also be placed in the PCCTS include directory. The makefile template given in section 1.4 knows where the include files live and will inform the C compiler. The following example uses the charptr package: #header <<#include "charptr.h">> << #include "charptr.c" main() { ANTLR(a(), stdin); } >> #token "\n" <<zzskip();>> a : "[a-z]+" <<printf("matched %s\n", $1);>> ; 4.1.4. Attribute references PCCTS accepts five basic attribute referencing notations: $i This format refers to the ith rule element within the current alternative which can be a rule reference, token or subrule. $i.j This format extends the $i notation to allow the user to refer to attributes at scopes above the current scope. i is the scope and j is the jth rule element within the alternative that encloses the current scope. $$ This is a shorthand for the $0 of the enclosing rule and can be referred to from an action anywhere within the rule. $label $rule is identical to $$ where rule is the name of the enclosing Page 42 PCCTS 1.00 rule. $parameter refers to one of the user-defined parameters given in the enclosing rule definition. $return_value refers to one of the user-defined return values given in the enclosing rule definition. $[token, text] This represents an attribute constructor. Its value is computed using zzcr_attr just like any other attribute with token and text as parameters. This is an explicit method of attribute creation whereas normally attributes are created implicitly - one for each input terminal. This mechanism can be useful when creating AST nodes which are functions of attributes or for passing values to subrules. For example, if attributes were integers and integer tokens were labeled NUM, $[NUM,"34"] would create an attribute whose value was 34. $[] returns an empty attribute node. 4.1.4.1. $i references As a simple example of $i numbering in rules with alternatives, consider the following. a: B | C ; Rule a has 2 alternatives. $i refers to the ith rule element in the current block and within the same alternative. So, in rule a, both B and C are $1. Below, rule a has a subrule as the second rule element and is referenced as $2 after the entire subrule has been matched. Assume that charptr.h attributes are in effect. a : "ick" ("A" <<$0 = $1;>> | "B" <<$0 = $1;>>) "ugh" <<printf("$1,$2,$3 are %s,%s,%s\n", $1,$2,$3);>> ; In this case, one of the following will be printed depending on whether "A" or "B" was found Page 43 PCCTS 1.00 $1,$2,$3 are ick,A,ugh $1,$2,$3 are ick,B,ugh $2 is the attribute of the subrule ("A"|"B") and is the second rule element of rule a (outermost level). However, within the subrule, "A" and "B" are both $1. The initial value of $0 will be undefined unless it either inherits a value or is assigned a value by defining the macro zzdef0. This value can, of course, be altered within the subrule by using $0 on the left side of an assignment. $$ always refers to the $0 of the outermost scope - the rule scope (i.e $1.0). $$ can be referenced from any user action within any subrule. 4.1.4.2. $i.j references The EBNF subrules of PCCTS introduce scoping as in programming languages. To access attributes at scopes above (in an enclosing scope) from within a user action, $i.j variables are used where i is the scope and j follows the normal attribute numbering. Levels are numbered from 1 starting at the rule level. Each subrule increments the level. Actions at level i cannot access attributes at level i+1 or below just like in a programming language. Attributes of the form $i are accepted as a shorthand. When a level is not indicated, the current scope will be assumed. For example, a : A B (C (D E) F) G ; The following table summarizes levels and attribute numbers: Page 44 PCCTS 1.00 ____________________________________________________ ____________________________________________________| | A || $1.1 | $1 | rule a after A | |______||_____________|________|____________________| |______||_____________|________|____________________| | C || $2.1 | $1 | (C (D E) F) | |______||_____________|________|____________________| |______||_____________|________|____________________| | E || $3.2 | $2 | (D E) | |______||_____________|________|____________________| |______||_____________|________|____________________| | G || $1.4 | $4 | rule a after G | |______||_____________|________|____________________| In this example, actions at level 1 (rule level) cannot access attri- butes within any of the subrules. Actions before the subrules cannot access these lower attributes because they have not been created yet (the terminals have not been recognized). Actions following the subrules cannot access the lower attributes because they have all been collapsed into one attribute--namely, $3. 4.1.4.3. $label references Inheritance (parameters and return values) and rule-level $0 attributes are referred to by name. $rule is identical to $$, but is more informative. The following two examples perform the same func- tion, but the first uses upward-inheritance and and the second uses $rule notation. #header <<#include "int.h">> <<main() { ANTLR(start(), stdin); } >> start : <<int r;>> expr > [r] "\n" <<printf("result %d\n", r);>> ; expr > [int result] : "[0-9]+" <<$result = $1;>> ( "\+" "[0-9]+" <<$result += $2;>> )* ; Note that the next version does not require a parameter definition since all rules implicitly return an attribute ($$, $rule). Page 45 PCCTS 1.00 #header <<#include "int.h">> <<main() { ANTLR(start(), stdin); } >> start : expr "\n" <<printf("result %d\n", $1);>> ; expr : "[0-9]+" <<$expr = $1;>> ( "\+" "[0-9]+" <<$expr += $2;>> )* ; Similarly, $parameter refers to one of the user-defined parameters. 4.1.5. $[token, text] attribute constructor This attribute is not automatically destroyed via zzd_attr if zzd_attr is defined. If necessary, the user must explicitly destroy these attributes. For example, #header <<#include "charptr.h">> << #include "charptr.c" main() {ANTLR(a(), stdin);} >> a : <<Attrib b = $[0, "a string"];>> <<printf("b is %s\n", b); zzd_attr(&b); >> ; which would yield b is a string Note that the token number is irrelevant for charptr type attributes since its zzcr_attr is a function of the text only. Page 46 PCCTS 1.00 4.1.6. Inheritance Inheritance refers to the ability of a rule to obtain information about the context in which it was invoked. Because ANTLR generates top-down parsers, information can be carried down through each rule invocation as well as returned. This ability, dubbed downward- inheritance, provides significant power because a rule can decide which rule invoked it or gain some other context altering information. LR parser-generators create parsers of the bottom-up variety and will never be able to pass information to a subrule because recognition begins at the leaves of the parse-tree and works it way upwards to the root (start symbol). 4.1.6.1. $0 inheritance $0 is one mechanism by which attributes can be inherited from and returned to an invoking block. The inheritance mechanism uses the attribute stack to initialize the $0 value seen by a subrule. Subrules can only pass one parameter which is implicitly typed as an attribute ($0). The $0 of a subrule is undefined unless set by a user action or by passing an attribute to that subrule. For example, decl : TypeID Var <<$2.type = $1;>> ( "," Var <<$2.type = $0;>> )*[$1] ; where the [ ] is an InheritAction mentioned above in the grammar for a PCCTS rule. The value inside is passed to the $0 of the subrule. This syntax is also used to pass parameters to rules. See below for more details. In this case, we are passing the type of a declaration (represented by the attribute associated with TypeID - $1 in the outermost scope) to a subrule which accepts a list of variables separated by commas. This could alternatively be accomplished via decl : TypeID Var <<$2.type = $1;>> ( "," Var <<$2.type = $1.1;>> )* ; or with a local variable. Page 47 PCCTS 1.00 decl : <<Attrib type;>> TypeID <<type = $1;>> Var <<$2.type = type;>> ( "," Var <<$2.type = type;>> )* ; Unlike subrules, rules employ the C hardware stack and $0 is never automatically set by inheritance. Despite this, the inheritance of a value for a rule's $0 can be accomplished by: rule[Attrib dummy]: <<$0=$dummy;>> remainder_of_rule ; This uses the more powerful rule communication mechanism and an assignment to $0 to simulate the inheritance mechanism used for subrules. Rules have a much more flexible inheritance scheme which, combined with rule-local variables for subrule inheritance, make the YACC-like attributes (Attrib's) useful primarily for communicating with the lexical analyzer. If a macro called zzdef0 is #define'd, it is executed upon entry to any block (rule or subrule) via zzdef0(&($0)). This macro can be used to ensure that all attributes have a certain property. For exam- ple, if $0 is not set in a rule, then the $i in the invoking rule will be undefined (some previous attribute typically). Setting the $0 ini- tially to NULL or some other meaningful value can be convenient. If attributes are string pointers, not assigning $0 to NULL before rule completion could cause an invoking rule to assume a valid pointer had been returned (as $i). Note that use of zzdef0 is incompatible with the use of $0 in subrules to inherit values, i.e., zzdef0 will destroy the inherited value. 4.1.6.2. Generalized rule communication PCCTS rules may pass parameters and return values just as C func- tions do (though, PCCTS rules can return multiple values). This mechanism can be viewed as generalized attribute handling where each rule defines its downward- and upward-inheritance attribute types. 4.1.6.2.1. Downward-inheritance The syntax for PCCTS parameter passing (downward-inheritance) is Page 48 PCCTS 1.00 identical to that of the ANSI C Standard [ANS90] except that the ( ) is replaced by [ ]. Specifically, parameter declarations are comma- separated lists of single declarations: rule[type var , ..., type var ] : ... ; 1 1 n n The parameter declarations are broken down into old style C declara- tions when parsers are generated: rule(var , ..., var ) type va1 ; n 1..; 1 type var ; { n n /* Stuff to recognize rule */ } which is recognized by both ANSI compliant and old-style C compilers. The parameters mentioned in the parameter declaration list may be used in user actions like any other variable just as in C except that they must be preceded by a $. The $ is consistent with references to the attributes associated with terminals and rule references. For example, a[int i, int j] : <<printf("%d\n", $i+$j);>> ; adds the two inherited attributes i and j and prints the result to stdout. It can be invoked in the following manner: rule : a[3,4] ; Page 49 PCCTS 1.00 4.1.6.2.2. Upward-inheritance Return values are similarly declared via a comma-separated list of types and names enclosed in [ ] (following the parameter definition if any): rule[parameters] > [type var , ..., type var ] : ... ; 1 1 n n The names given in the return value definition may be used to expli- citly set the return values. These may be treated like $-prefixed variables and can referenced as well as set. The names in the return value definition may not conflict with any parameter or local variable defined elsewhere in that rule. a[int i, int j] > [int x, int y] : <<$x = $i+$j; $y = 0;>> ; Rule a can be invoked in the following manner: rule : <<int i,j;>> a[3,4] > [i,j] ; which passes 3 and 4 to rule a which returns 7 and 0 placing them into i and j respectively. 4.2. Abstract-syntax-tree construction and manipulation Many translation problems can be accomplished with actions embed- ded in a grammar that simply compute an output phrase and send it to the output stream. Unfortunately, most translation problems cannot be implemented in such a straightforward manner. For this reason, PCCTS supports an abstract-syntax-tree (AST) mechanism that provides a framework for constructing and manipulating intermediate representa- tions of the input stream. Without touching the grammar description, the user may direct PCCTS (by using the -gt command-line option) to automatically generate an AST of the input stream. By default, this AST is a flat tree (a list) with one node for each input token. Generally, the organization Page 50 PCCTS 1.00 of an AST reflects the language structure used to recognize an input phrase. A flat AST does not contain information about the language structure and, in effect, would only be a copy of the input. The default automatic tree construction can be modified by annotating the grammar with a few appropriately placed characters and is sufficient to handle many situations (such as arithmetic expressions) well. Some intermediate-forms cannot be adequately described by simple grammar annotations (e.g. type-trees for the C programming language). A more general tree construction mechanism is provided by PCCTS to allow explicit intermediate-form tree description and is described in section 4.2.4. AST nodes and subtrees, both automatically and explicitly con- structed, are referenced via #-variables. The notation and behavior of these variables is similar to that of the attribute $-variables. They are always implicitly typed as pointers to AST nodes and are gen- erally associated with rule and token references; rules yield subtrees and tokens yield nodes. PCCTS parsers manage the actual creation of nodes and the con- struction of the trees, but the user must describe what information an AST node is to carry and how to fill in that information. The command-line option -gt must be used to access any of the features is this section. Also note that the PCCTS automatically defines #define GENAST so that the user may write code that depends on whether or not trees are being constructed. 4.2.1. AST node creation PCCTS provides a default and an explicit node constructor. The default constructor is applied for each token in a rule defined without the ! modifier (which indicates that the rule should disable automatic tree construction) and is a function of the associated attribute, token and lexical text. The explicit constructor is user defined and can take a variable number of arguments. A macro called zzcr_ast() can be defined to give a default node definition or transformation. Also, user actions embedded in the grammar can modify AST fields. AST nodes have the structure struct _ast { struct _ast *right, *down; user_defined_fields }; where user_defined_fields is filled in via the #define constant: AST_FIELDS. Only the user-defined fields should be modified by the Page 51 PCCTS 1.00 user as right and down are handled by the code inserted by PCCTS and could be subject to change. zzcr_ast() is of the form #define zzcr_ast(ast,attr,tok,text) *ast = F(attr, tok, text); with AST *ast; Attrib *attr; int tok; char *text; For example, a simple, but useful attribute and AST definition is #header << typedef int Attrib; #define AST_FIELDS int i; #define zzcr_attr(attr, token, text) *(attr)=atoi(text) #define zzcr_ast(ast, attr, tok, text) (ast)->i = *(attr) >> which defines both attributes and AST node data to be integers. This default AST node construction merely copies the integer found on the input stream into the data field i of the new AST node. AST nodes may be explicitly constructed via a user-defined func- tion called zzmk_ast() which is defined as follows: Page 52 PCCTS 1.00 #include <varargs.h> /* fill in an AST node (which is passed in) and return a ptr to it */ AST *zzmk_ast(va_alist) va_dcl { va_list ap; AST *node; va_start(ap); node = va_arg(ap, AST *); /* 1st arg is always a new AST node */ ... /* fill in node */ va_end(ap); return node; } Note that the K&R-style variable argument method is not required if you are not interested in portability or a constant number of parame- ters is passed (if only one node type is used, for example). A port- able ANSI C version would look like the following: #include <stdarg.h> /* fill in an AST node (which is passed in) and return a ptr to it */ AST *zzmk_ast(AST *node, ...) { va_list ap; va_start(ap, node); ... /* fill in node */ va_end(ap); return node; } To make an explicit node constructor that duplicates the simple integer AST node definition given above, the following would suffice (ANSI C version): Page 53 PCCTS 1.00 #header << typedef int Attrib; #define AST_FIELDS int i; #define zzcr_attr(attr, token, text) *(attr)=atoi(text) #define zzcr_ast(ast, attr, tok, text) (ast)->i = *(attr) #include <stdarg.h> >> AST *zzmk_ast(AST *node, ...) { va_list ap; va_start(ap, node); node->i = va_arg(ap, int); va_end(ap); return node; } Now, an action may construct tree nodes via #[integer]. For example, a > [AST *node] : <<$node = #[34];>> /* return a node with 34 in it */ ; The #[ ] node constructor is detailed in section 4.2.6. 4.2.2. AST node destruction A routine called zzfree_ast(AST *t) is used to free any abstract-syntax-tree created by an PCCTS parser. If defined, a macro called zzd_ast is applied to each AST node before its memory is free()'d. This can be used to free any additional memory that may have been allocated by zzcr_ast and attached to the AST nodes. The zzd_ast macro is passed the address of the node it is to destroy and is of the form #define zzd_ast( treeNodePtr ) /* perform operation on tree node */ Page 54 PCCTS 1.00 zzd_ast must be defined in or included into the #header action. zzfree_ast() must be called explicitly by the user. 4.2.3. Automatic AST construction For small or uniform grammars (like C language expressions), PCCTS provides a concise tree construction mechanism. Token refer- ences are annotated to indicate o which tokens are to be subtree roots in the AST o which tokens are to be excluded from the AST Any non-modified (non-root and non-excluded) token is considered a leaf node. Rule names are never included in the automatically gen- erated AST's. The resulting tree is composed entirely of nodes derived from grammar terminals. Each rule yields one subtree. Subrules do not create subrules; elements enclosed within subrules modify the subtree for the current rule. Subtrees and nodes created from terminal references are accessed from user actions via #- variables which are discussed in section 4.2.6. 4.2.3.1. Grammar annotation PCCTS generates an AST by placing C code into the parser that, at parser run-time, builds a tree and makes it available to the user's program. AST's are stored in a child-sibling list form similar to that used by LISP. This allows each level in the AST to have one root and arbitrarily many children: (root child child ... child ) 1 2 n Suffixing a token with ^ identifies it as a subtree root. Each root-token encountered increases the depth of the current subtree by 1. All further leaf tokens are considered siblings to the previous root node. A ! suffix on a terminal reference excludes the terminal from the AST (no AST node is created or linked in). Suffixing a rule reference with a ! indicates that the subtree created by that rule reference is not to be linked into the current subtree, but is still to be made available via #i variables. Rule references yield a pointer to the root of the subtree created by that rule invocation. Therefore, the subtree roots returned by rules become siblings in the current rule. Rules defined with a ! immediately following the rule name (and before the :) do not automatically construct trees. A ! in the rule definition is like placing a ! after each grammar element within that rule. #-variables do not exist for tokens suffixed with !. Hence, tree nodes are not automatically constructed within these rules although references to other rules can still create subtrees that can be accessed via #-variables. Page 55 PCCTS 1.00 To demonstrate the actual construction of AST's, consider the follow- ing grammar fragment. x : A B^ C! D y ; y : E^ F ; The grammar matches A B C D E F and considers B to be a subtree root. Token C is to be ignored. Rule y matches E F and returns a subtree of depth 2 to rule x where the root of rule y's subtree becomes available as #5. The following fig- ure sequence shows the state of the AST after each token has been recognized. The various AST nodes are labeled with the #-variable that could be used to access that node. The LISP-like child-sibling notation is given after the input state. [Figure omitted] A o B C D E F, ( A ) [Figure omitted] A B o C D E F, ( B A ) [Figure omitted] A B C o D E F, ( B A ) [Figure omitted] Page 56 PCCTS 1.00 A B C D o E F, ( B A D ) [Figure omitted] A B C D E F o, ( B A D ( E F ) ) Because of the ! suffix, terminal C does not appear in the AST. Ter- minal B is a root-token because of the ^ modifier and is therefore the root of the subtree created for rule x. The root of the subtree created by rule y is a sibling of the leaf nodes of rule x. The way in which rules describe a grammar effects AST construction. e.g. rules x and y could be rewritten to recognize the same input, but to construct a different tree. x : A B^ C! D E^ y ; y : F ; In this case the root-token E will become the root of the tree matched up to that point (i.e. ( B A D )) yielding ( E ( B A D ) ) The final AST would be ( E ( B A D ) F ) 4.2.3.2. Operator precedence Operator associativity in the AST's is directly related to the precedence implicitly defined among the rules. For example, to specify that the exponentiation operator, **, groups right-to-left, a grammar similar to the following would be required. Page 57 PCCTS 1.00 expr : exp0 { "\*\*"^ expr } ; exp0 : "[0-9]+" ; The input 3**4**5 would yield ( \*\* 3 ( \*\* 4 5 ) ) indicating that the 4**5 would be performed first. For a full explanation of arbitrary tree construction see the section on explicit AST construction. However, the following example is beneficial in this section. Give a grammar fragment that recog- nizes a simple assignment, a tree of the form: ( := expr lvalue ) can be constructed via: assign! : lvalue ":=" expr ";" <<#0 = #(#[$2], #3, #1);>> ; The ! after the rule definition tells PCCTS to turn off automatic tree construction for this rule. #[$2] is an AST node constructed from an attribute. 4.2.3.3. Example The following simple grammar illustrates the automatic construc- tion of AST's, the definition/creation of AST nodes, and the use of #i variables Page 58 PCCTS 1.00 #header << typedef int Attrib; #define AST_FIELDS int i, token; #define zzcr_attr(attr, token, text) *(attr)=atoi(text) #define zzcr_ast(ast, attr, tok, text) {(ast)->i = *(attr); (ast)->token=0;} >> << AST *root = NULL; /* define a root ptr for AST; must be NULL */ void show(tree) /* define function to show a tree node */ AST *tree; { if ( tree->token != 0 ) printf(" %s", zztokens[tree->token]); else printf(" %d", tree->i); } void before() { printf(" ("); } void after() { printf(" )"); } main() { ANTLR(expr(&root), stdin); /* get an expr AST by passing &root */ zzpre_ast(root, show, before, after);/* print out in LISP notation */ } >> expr : exp0 ( "\+"^ <<#1->token=LA(1);>> exp0 )* "\n"! ; exp0 : exp1 ( "\*"^ <<#1->token=LA(1);>> exp1 )* ; exp1 : "[0-9]+" ; The actions in rules expr and exp0 set the token field (defined in AST_FIELDS) to the current token of look-ahead (LA(1)). This tells the predefined astpreorder() function to print the token value and not the number value (field i). Note that we must pass the address of a root pointer to expr so that expr can make it point to the subtree Page 59 PCCTS 1.00 constructed for that rule. The newline character is not included in the trees because of the ! suffix. The + and the * characters are considered operators in our language and so they are modified by ^. The job performed by the default AST node creation macro zzcr_ast could have just as easily been accomplished by removing zzcr_ast and changing rule exp1 to be exp1 : "[0-9]+" <<#1->i = $1; #1->token = 0;>> ; which is somewhat more explicit, but would be inconvenient if many such actions were required. If the input to this PCCTS parser example were 3+4, the program would print out ( \+ 3 4 ) which represents [Figure omitted] or, in child-sibling form [Figure omitted] Note that the AST is significantly different than the parse-tree which can be represented by [Figure omitted] An input of 3+4*5 yields ( \+ 3 ( \* 4 5 ) ) Page 60 PCCTS 1.00 which represents [Figure omitted] or, in child-sibling form [Figure omitted] The 4*5 would be performed before the addition because one of the addition's operands is itself an operation (i.e. a multiply). 4.2.4. Explicit AST construction The automatic tree construction scheme available with PCCTS is useful for building AST's whose structure is similar to that of the grammar parse trees. It is not suited to applications where the structure of the grammar bears no resemblance to the structure required in the intermediate-form. C declarations are a good example of this and a sample "C declaration to English" translator is developed in this section. Reasonable tree-construction and tree-rewriting systems, such as Purtilo and Callahan's [PuC89] NewYacc, have been developed but most rely on bottom-up parser generators like YACC to build a parse-tree in memory which is later traversed to simulate inherited attributes. PCCTS parsers construct AST's while parsing the input. Rules can pass subtrees as arguments to other rules and construct AST's without first having to build a parse-tree. PCCTS supports general AST construction via two simple tree description constructs - AST node constructors and AST tree definitions. The AST node constructor is similar to the constructor for attri- butes ($[token,text]) except that AST node constructors yield pointers to AST nodes whereas attribute constructors yield attributes (of type Attrib). #[args] creates an AST node and passes a pointer to it to zzmk_ast() with args as arguments. Constructors may be nested to create AST nodes from attributes; e.g. #[ $[token,text] ]. Trees are defined using a LISP-like notation because of its brev- ity. #( root, child , child , ..., child ) 1 2 n where root and child are pointers to AST nodes which can themselves be trees or node iconstructors. The return tree for any rule is #0 Page 61 PCCTS 1.00 which can appear on both the left- and right-hand side of an assign- ment statement. Automatic construction of trees will proceed unless the ! is included in the rule definition to override the normal AST mechanism. The same grammar may have some rules that use the automatic tree mechanism and other rules that explicitly construct trees. However, the methods must not be mixed within the same rule. In other words, do not assign a tree to #0 unless you have turned off the automatic tree construction. When default AST construction is turned off, #-variables do not exist for terminals or rule references except for references to rules that construct trees (explicitly or automatically). Nodes for terminals must be explicitly created just as the tree structure must be explicitly defined. 4.2.4.1. Simple example We present a trivial example in this section to illustrate the AST node and tree constructors. #header << #include "int.h" #define AST_FIELDS int i; >> Page 62 PCCTS 1.00 << AST *root = NULL; /* define a root ptr for AST; must be NULL */ void show(tree) /* define function to show a tree node */ AST *tree; { printf(" %d", tree->i); } void before() { printf(" ("); } void after() { printf(" )"); } AST *zzmk_ast(node, v) AST *node; int v; { node->i = v; return node; } main() { ANTLR(tree(&root), stdin); zzpre_ast(root, show, before, after); } >> tree! : NUM NUM NUM "\n" <<#0 = #( #[$[NUM,"101"]], #[$1], #[$2], #[$3] );>> ; #token "\ " <<zzskip();>> #token NUM "[0-9]+" Given the input 3 4 5, rule tree would return ( 101 3 4 5 ) The rule itself matches three integers followed by a newline. As Page 63 PCCTS 1.00 usual, the four attributes $1-$4 exist, but #1-#4 do not because rule tree was defined with a ! modifier. To create AST nodes for the ter- minals we use the node construct #[$i] which returns a new node con- structed through zzmk_ast. The newline is not to be included in the tree so no AST node was constructed for it. To illustrate the attri- bute constructor, a root node was created as a function of an attri- bute with a value of 101, though #[101] would have been easier. There is no corresponding terminal in the grammar for this AST node and therefore one was created. $[NUM,"101"] creates an attribute through zzcr_attr which is passed to zzmk_ast when enclosed in #[]. The sub- tree created in rule tree is explicitly returned by making #0 point to the root of the subtree. 4.2.4.2. C declaration to English translator The type declaration syntax of C is somewhat bizarre and can be troublesome to even experienced programmers. The syntax of declara- tion mirrors that of use. In other words, variables are defined the way they would be used in an expression. We give a loose description of how to correctly "parse" a C declaration and then we present a com- plete translator to describe, in English, what the declaration means. Kernighan and Ritchie [KeR88] give an excellent description of how to understand C declarations and they present a program to convert from C to a word description. To parse a declaration as a Human, we follow a few simple rules. (i) Start parsing at the variable name. (ii) Scan to the right. For each [expr] or () encountered say "n- element array of" or "function returning" respectively where n is the (optional) expression found inside the brackets. Continue scanning until a right-parenthesis or a semicolon. (iii)Now, scan left. For each * say "pointer to" until you see a type name or a left-parenthesis. (iv) Move to the next outer nesting level if one exists and begin again at rule (ii). If no more declaration symbols exist, look to the left and say the type name (ignoring storage classes here). For example, let us parse int *(*a[10])(); By (i), we begin at a. Looking to the right in (ii), we see brackets and so we say "10-element array of." The right-parenthesis makes us jump to rule (iii) and scan left. We see a * and say "pointer to." Page 64 PCCTS 1.00 The left-parenthesis finishes off (iii) and rule (iv) tells us to exit this level of parenthesis (nesting) and start again at (ii). Scanning to the right we see () and say "function returning." The semicolon forces us to (iii) where we see a * and say "pointer to." (iv) tells us to say "integer." Packing all of the "sayings" together yields: "10-element array of pointer to function returning pointer to integer" or, in a flat tree representation, ( [10] ( * ( () ( * int ) ) ) )." Our example parser will construct trees of a similar form but with result in a little better English. To build a type tree for C we "unroll" the nesting of the declaration so that it can be read left-to-right. Our trees will have the variable name at the root and the type as the leaf. Each node in the tree will be a type-qualifier. The type-qualifiers are one of [expr], (), * and are linked together in a parent/child relationship. e.g. ( i ( * int ) ) means that i is a pointer to an integer. The translator uses simple character buffers as attributes and AST node data. The trees are constructed at parse-time and later traversed to print out the description. A function to find the bottom of a parent/child list is required to augment the normal tree con- struction features of PCCTS. #header << #include "charbuf.h" #define AST_FIELDS Attrib attr; >> << AST *root = NULL; AST *zzmk_ast(node,attr) AST *node; Attrib attr; { node->attr = attr; return node; } main() { ANTLR(start(&root), stdin); } Page 65 PCCTS 1.00 AST *bottom(t) AST *t; { if ( t==NULL ) return NULL; if ( t->down != NULL ) return bottom(t->down); return t; } english(tree) AST *tree; { printf("%s is", tree->attr.text); engl(tree->down, 0); } engl(tree, plural) AST *tree; int plural; { if ( tree == NULL ) return; if ( strcmp(tree->attr.text, "*") == 0 ) printf(plural?" pointers to":" a pointer to"); else if ( strcmp(tree->attr.text, "nodim") == 0 ) { printf(plural?" arrays of":" an array of"); plural = 1; } else if ( strcmp(tree->attr.text, "int") == 0 ) printf(plural?" integers\n":" an integer\n"); else if ( strcmp(tree->attr.text, "ret") == 0 ) printf(plural?" functions returning": " a function returning"); else { if ( plural ) printf(" %s-element arrays of", tree->attr.text); else printf(" a %s-element array of", tree->attr.text); plural = 1; } engl(zzchild(tree), plural); engl(zzsibling(tree), plural); } >> Page 66 PCCTS 1.00 #token "[\ \t\n]+" <<zzskip();>> start! : ( d <<english(#1); zzfree_ast(#1);>> )* ; d! : <<AST *word;>> "int" d1[#[$1]] > [word] ";" <<#0 = #(word, #2);>> ; d1![AST *type] > [AST *word] : "\*" d1[#(#[$1], $type)] > [$word] <<#0 = #2;>> | d2[$type] > [$word] <<#0 = #1;>> ; d2![AST *type] > [AST *word] : WORD d3[$type] <<$word = #[$1]; #0 = #2;>> | "$" d1[NULL] > [$word] "$" d3[$type] <<#( bottom(#2), #4 ); #0 = (#2==NULL)? #4 : #2;>> ; d3![AST *type] : "\[" NUM "\]" d3[type] <<#0 = #( #[$2], #4 );>> | "\[" "\]" d3[type] <<#0 = #( #[$[WORD,"nodim"]], #3 );>> | "$" "$" <<#0 = #( #[$[WORD,"ret"]], type );>> | <<#0 = type;>> ; #token NUM "[0-9]+" #token WORD "[a-z]+" The rules of this grammar perform the following functions startMatch multiple declarations translating each one to English and freeing each tree. d Only integers are recognized for simplicity. Declarations begin with "int" and are terminated by a ";". Rule d1 returns a pointer to the AST containing the symbol name found in the declaration (which has not yet been added to the tree). Rule d returns the declaration found in rule d1 with the symbol name as the new root. The type node is always the leaf and is passed to rule d1 which builds everything on top of the type node. d1 The * is the lowest priority symbol in a declaration and hence we "look" to the right first. The * is appended to whatever is found to the right. This is accomplished by passing the * and the currently built tree to rule d2. Since we are scanning from the left, but must translate as if starting from the most deeply Page 67 PCCTS 1.00 nested structure, we must keep adding type-qualifiers on top of the root. The current tree is always passed down. Rule d1 returns a pointer to the AST node containing the variable name. d2 Recognize a variable name followed by an array/function descrip- tor or recognize a parenthesized declarator. Rule d2 creates an AST node for the variable name and returns it. The tree con- structed by rule d3 is also returned. We need to attach the array/function descriptor to the bottom of the parenthesized declarator which could be arbitrarily large. For this reason, we use the bottom() function to find the end of the declarator returned by the reference to d1. d3 Match an array or function descriptor. Note that two tokens of look-ahead are required to decide between alternatives one and two since both begin with a left-bracket. Tail-recursion is used to get a sequence of [] or () descriptors. Array descriptors like [34] are converted to tree nodes containing just the number of elements. If no size is present, "nodim" is substituted for the number in the AST node. If nothing is seen by rule d3, it returns the type that was passed in since nothing was appended. Function descriptors form type-qualifiers labeled "ret". It is interesting to note that the grammar actions required to con- struct the type trees were small and mostly assignments. The explicit tree constructors allow complicated trees to be built without a great deal of C code. The translator can be made using the following makefile. # Makefile for C -> English translator CFLAGS= -I../h AFLAGS= -gt -k 2 GRM=decl SRC=scan.c $(GRM).c err.c OBJ=scan.o $(GRM).o err.o test: $(OBJ) $(SRC) cc -o $(GRM) $(OBJ) $(GRM).c parser.dlg : $(GRM).g antlr $(AFLAGS) $(GRM).g scan.c : parser.dlg dlg -C2 parser.dlg scan.c Our translator knows how and when to make things plural so it gen- erates correct English. For example, Page 68 PCCTS 1.00 int *(*i[5])(); yields i is a 5-element array of pointers to functions returning pointers to integers 4.2.5. Tree manipulation routines PCCTS generates code that calls the following set of tree manipu- lation routines to create AST's. The user may modify the routines to suit their own purposes. The routines and AST node definitions are located in ast.c and ast.h respectively and reside in the PCCTS include directory. They are automatically included when the ANTLR -gt command-line option is specified. Routines prefixed with underscore are not normally called by the user. void _link(_root, _sibling, _tail) Called to ensure tree manipulation variables for current rule are valid after a rule reference. AST *_astnew() Create a new AST node and return it. void _child(_root, _sibling, _tail) Add a leaf node to the current AST. Creates an AST node, applies the default node transformation (zzcr_ast) if it exists, and appends the node to the current sibling list. After this func- tion, #i variables are available for user actions. void _subroot(_root, _sibling, _tail) Perform the same operations as _child(), but make the newly- created node the root for the current sibling list. If a root node exists, make the newly-created node the root of the current root. void zzpre_ast(tree, apply, before, after) Perform a preorder traversal of tree applying apply to each node. Apply function before before the start of each new subtree and function after after you complete each subtree. A common use of this function is to print out a tree. For exam- ple, if AST nodes had a field called token, the tree of tokens could be printed by defining the following Page 69 PCCTS 1.00 void show(tree) AST *tree; { if ( tree == NULL ) return; printf(" %s", zztokens[tree->token]); } void before() { printf(" ("); } void after() { printf(" )"); } and then calling zzpre_ast() as zzpre_ast(some_tree, show, before, after); AST *zzdup_ast(tree) Return a duplicate of the tree passed in. void zzfree_ast(tree) Free the AST pointed to by tree applying zzd_ast to each node before freeing if it is defined. zzrm_ast This macro is used to free the subtree for the current rule (using zzfree_ast()) and remove all references to it. PCCTS maintains local pointers to the current sibling list etc... which must be NULL'ed in order to begin a new subtree. Translators that perform an operation on a list of items using ()* or ()+ can add an action to free and remove references to the AST associated with the different items upon each iteration of the loop. e.g. (item <<zzrm_ast;>>)+. zzchild(tree) Return the first child of the node pointed to by tree. Returns NULL if tree is NULL. zzsibling(tree) Return the immediate sibling of the node pointed to by tree. Returns NULL if tree is NULL. The parameters tree, _root, _sibling and _tail are defined as Page 70 PCCTS 1.00 AST *tree, **_root, **_sibling, **_tail; A useful modification to _astnew() would be to use custom allocation of AST nodes. Since they are all the same size, it could be done quickly and efficiently. 4.2.6. AST references References to AST objects come in three basic forms. #i This attribute refers to the tree constructed for the ith rule element within the current alternative. #i variables are scoped/numbered just like $i variables and their association dis- solves after leaving the current attribute scope. #i is unique until the current block is exited or another iteration of a ( )* or a ( )+ loop has begun. #0 is special and always refers to the root of the subtree asso- ciated with the rule in which it was referenced. #0 is scoped like $$, not like $0; there is only one #0 for the entire rule. If #0 is ever set to NULL, the current subtree disappears (without freeing the nodes). Normally, #0 is only set when the automatic AST tree creation is turned off with a ! after the rule name in a definition. Since subrules do not have their own #0 values. The #i value of a subrule is undefined. However, for compatibility with $i numbering, subrules are counted as though they were assigned #i values: An example #-variable number is as follows. a : B ( C D /* #1 is C, #2 is D */ )* E /* #1 is B, #2 does not exist, #3 is E */ ; When the automatic tree construction is turned off for a rule, #-variables for terminals are undefined since nodes are not created for them. #-variables always exist for rule references since the referenced rule may construct a tree (implicitly or explicitly). Page 71 PCCTS 1.00 #[args] This form constructs an AST node and returns a pointer to it (of type AST *). The arguments and the newly-constructed node are passed to the zzmk_ast() function defined by the user. #(root, child , child , ..., child ) This for_ repres_nts a tree d_scriptor from a set of AST node pointers. It is translated to a function call that links all the child nodes together as siblings, with root as their parent. A pointer to the root is returned. If the root is NULL, a pointer to the sibling list is returned. #( NULL ) and #() both return NULL. #( nodeptr ) yields nodeptr. All child'ren become siblings of each other and any siblings of the child'ren that existed previously become siblings also. For example, #( NULL, #( NULL, #[a], #[b]), #[c] ) results in a, b and c all being siblings. 4.2.7. Invoking parsers that construct trees PCCTS parsers pass an address of a root pointer to other rules while constructing AST's. This root pointer is always the first parameter and will point to the subtree built for that rule upon rule completion. Any parameters defined for a rule are not affected by this additional rule argument. All root pointers must be set to NULL initially. Normally, PCCTS generates code to handle all pointer parameter passing and tree manipulations. However, the starting rule (i.e. the one called using the ANTLR macro) may try to reference the root pointer it presumes was passed in. Therefore, the address of a NULL-initialized root pointer must be manually passed in. e.g. AST *root = NULL; main() { ANTLR(start(&root), stdin); } start : stuff ; This root parameter will be not be defined in rule start if the -gt command-line option is not used. Page 72 PCCTS 1.00 Note that if main() is located in a file other than a grammar file, the user must explicitly include ast.h. 4.2.8. Error recovery and AST's When a syntax error is discovered by an ANTLR-generated parser, no further additions will be made to the subtree of the current rule. Additions will resume after "successful" error recovery. The AST is always left in a stable state (i.e. a valid child-sibling tree) so that user actions may examine and/or modify the tree. This also ensures that syntax errors will not result in ill-formed trees. 5. Error detection and repair Reasonable error recovery is a notoriously difficult task which is handled very poorly, or not at all, by most parser-generator sys- tems. In part, this may be due to the fact that most systems con- struct LR parsers, and these are theoretically less able to recover from errors than LL parsers [FiL88]. It can also be justified on the basis that a parser can be used to quickly find the first error, so that an interactive user can repair the error and recompile without having to read through a multitude of additional error messages which might represent the parser's confusion rather than actual errors. In contrast, PCCTS attempts to provide a simple, automatic, error reporting and recovery mechanism. Currently, ANTLR-generated parsers recover from syntax errors by consuming tokens until a token in the FOLLOW set of X is discovered where X is the rule in which the error was discovered. Errors are reported in terms of the set of tokens which could legally have appeared at that point. A syntax error occurs when a PCCTS parser discovers a token on the input stream that cannot be matched to any currently recognizable rule element. Because PCCTS performs grammar analysis on the user's grammar description, it is aware of all possible tokens that can be recognized at a given instant. Sets representing these tokens are passed to an error reporting function when a syntax error is discovered. Optionally, a fail-action can be executed. [Wir76] provides a simple mechanism for recovering from syntax errors in LL(1) parsers by consuming tokens until the current token is found to be a member of a "stopping set." This stopping set is comprised of the FOLLOW set for that rule and any user-defined marking ("resynch") tokens that are never to be skipped (e.g. BEGIN). PCCTS currently employs a simple variant of the [Wir76] mechanism. When a syntax error occurs in any production of rule X, the parser will con- sume input until the current token is a member of the stopping set for X. The syntax error recovery mechanism employed by PCCTS does not effect parser recognition speed unless an error is discovered. PCCTS error recovery is attractive because it is completely automatic, but it tends to delete groups of tokens which may lead to erroneous resyn- chronizations. This "glutton" scheme could be refined by allowing the Page 73 PCCTS 1.00 user to specify a set of resynch tokens to augment the FOLLOW set [Wir76]. These resynch tokens would be determined by experience. Sophisticated error recovery is possible using k tokens of look- ahead. Future versions of PCCTS might employ this large look-ahead for intelligent error recovery. Hence, it can be expected that automatic error handling will improve without requiring users to change the input to PCCTS. 5.1. Modifying grammars to recognize common syntax errors The user may modify a given grammar to recognize common, but erroneous sentences. For example, if rule expr8 below were part of an expression recognizer for a programming language, expr8 : VAR | INT | FUNC "$" elist "$" ; it could be modified to print an error message whenever an undefined variable was seen. In other words, expr8 : VAR | INT | FUNC "$" elist "$" | WORD <<error("Undefined variable: %s", $1);>> ; Without this additional production, the parser would print something like (assuming test was the undefined variable found): syntax error at "test" missing { VAR INT FUNC } The extra production would allow a more informative error message to be printed, such as: Page 74 PCCTS 1.00 Undefined variable: "test" 5.2. Default error reporting function The default error reporting function, zzsyn(), is defined as fol- lows. void zzsyn(char *text, /* text of current token */ int tok, /* current token */ char *egroup, /* label attached to rule with error if present */ unsigned *eset, /* etok==0 -> >1 token missing; set is eset */ int etok) /* etok>0 -> 1 token missing; token is etok */ { fprintf(stderr, "syntax error at \"%s\"", text); fprintf(stderr, " missing "); if ( !etok ) zzedecode(eset); else fprintf(stderr, "%s", zztokens[etok]); if ( strlen(egroup) > 0 ) fprintf(stderr, " in %s", egroup); fprintf(stderr, "\n"); } A list of the regular expression or label associated with each token value is stored in the char *zztokens[] array in err.c. zzedecode() is a predefined function that decodes the set of tokens stored in the bit-set eset. zzsyn() can be modified according to the user's needs. For exam- ple, it can be modified to print out the file and line number (zzline). 5.3. Error groups The user may attach a string to a non-terminal that is passed on to the error reporting macro/function zzsyn() by placing it just before the : in a rule definition. e.g. declarator "declaration or definition" : ... ; Page 75 PCCTS 1.00 Although zzsyn() can be redefined by the user, the default error reporting facility appends the specified string to the error message. More than one rule may share the same string. This mechanism can gen- erate decent error messages. For example, one could specify that all rules dealing with expressions (e.g. factor, term, conditional) are to be labeled as "expression". The default zzsyn() would then yield something like: syntax error at "NUM" missing { "\+" "\*" } in expression as opposed to syntax error at "NUM" missing { "\+" "\*" } Another example might be that all rules used to define variables, such as rule declarator above, should be labeled "declaration or definition". 5.4. Error token classes The default syntax error reporting facility generates a list of tokens that could be possibility matched when the erroneous token was encountered. Often, this list is large and means little to the user for anything but small grammars. For example, an expression recog- nizer might generate the following error message for an invalid expression, "a . b": syntax error at "." missing { "\+" "\-" "\*" "\/" "\&" "\|" "\<" "\>" "\=" ";" } A better error message would be syntax error at "." missing { operator ";" } This modification can be accomplished by defining the error class: Page 76 PCCTS 1.00 #errclass "operator" { "\+" "\-" "\*" "\/" "\&" "\|" "\<" "\>" "\=" } 5.4.1. Error class definitions The general syntax for #errclass is as follows: #errclass label { e e ... e } 1 2 n where label is either a quoted string or a label (capitalized just like token labels). The quoted string must not conflict with any rule name or regular expression. Groups of expressions will be replaced with this string and it must not appear to be a simple token. Simi- larly, an error class label may not share the same name as a labeled token. The error class elements, e , can be a labeled token, a regu- lar expression or a non-terminal. T_kens (labeled or regular expres- sions) referenced within an error class must at some point in the grammar be referenced in a rule or explicitly defined with #token. The definition need not appear before the #errclass definition. If a non-terminal is referenced, the FIRST set (set of all tokens that can be recognized first upon entering a rule) for that rule is instan- tiated into the error class. The -ge command-line option can be used to have PCCTS generate an error class for each non-terminal of the form: #errclass Rule { rule } which implies that each non-terminal will have an error class composed of the first set for that rule. The error class name is the same as the non-terminal except that the first character is converted to upper case. Error class names may also be used as error class elements (e ) since error class names are treated somewhat like tokens. This capa- bility allows error class hierarchies. For example, Page 77 PCCTS 1.00 #errclass Fruit { CHERRY APPLE } #errclass Meat { COW PIG } #errclass "stuff you can eat" { Fruit Meat } yum : (CHERRY | APPLE) PIE | (COW | PIG) FARM | THE (CHERRY | APPLE) TREE ; Different error messages will result depending upon where in rule a a syntax error is detected. If the input were THE PIG TREE the following error message would result: syntax error at "PIG" missing { Fruit } However, if the input were FARM COW the decent error message syntax error at "FARM" missing { "stuff you can eat" THE } would result. Note that without the error class definitions, the error message would have been: Page 78 PCCTS 1.00 syntax error at "FARM" missing { CHERRY APPLE COW PIG THE } which conveys the same information, but at a much lower level. 5.4.2. Error class utilization PCCTS attempts to construct sets of tokens for error reporting-- error sets. The sets are created wherever a parsing decision will be made in the generated parser. At every point in the parsing process there exists a set of currently recognizable/acceptable terminals. This set can be decoded and printed out when a syntax error is detected - when the current token is not currently recognizable. PCCTS attempts to replace subsets of all error sets with error classes defined by the user. For example, rule a below contains a subrule with more than one alternative which implies that a parsing decision will be required at run-time to determine which alternative to choose. a : (Happy | Sad | Funny | Carefree) Person ; If, upon entering rule a, the current token is not one of the four terminals found in the alternatives, a syntax error will have occurred and the following message would be generated (if "huh" were the current token): syntax error at "huh" missing { Happy Sad Funny Carefree } Let us define an error class called Adjective that groups those same four tokens together. #errclass Adjective { Happy Sad Funny Carefree } Now, the error message would be: Page 79 PCCTS 1.00 syntax error at "huh" missing { Adjective } PCCTS repeatedly trys to replace subsets of the error set until no more substitutions can be made. At each replacement iteration, the largest error class that is completely contained within the error set is substituted for that group of tokens. One replacement iteration may perform some substitution that makes another, previously inviable, substitution possible. This allows the hierarchy mechanism described above in the error class description section. The sequence of substi- tutions for the yum example would be: [i] { CHERRY APPLE COW PIG THE } [ii] { Fruit COW PIG THE } [iii] { Fruit Meat THE } [iv] { "stuff you can eat" THE } The error class mechanism leads to smaller error sets and can be used to provide more informative error messages. 6. Things you should know about This section is a collection of important things that did not properly belong in any other section. 6.1. Available program symbols This section describes all symbols visible to the user that may be of interest. 6.1.1. Macros, functions, #define's Page 80 PCCTS 1.00 ZZCOL /* define so that DLG tracks column # of tokens */ ZZLEXBUFSIZE /* if not set, default size of lexical bufs */ ZZA_STACKSIZE /* if not set, default size of attrib stack */ ZZAST_STACKSIZE /* if not set, default size of AST stack */ zzTRACE(rule) /* macro called at start of rule when -gd set */ LL_K /* == max # tokens of look-ahead used in parser */ ANTLR(start,stream) /* ANTLR startup macro; get input from stream */ ANTLRf(start,funcPtr) /* ANTLR startup macro; get input from func */ zzcr_attr() /* create attribute from token, text on input */ zzd_attr() /* call zzd_attr(&($i)) for each attrib created */ zzcr_ast() /* how to create an AST node from attribute */ zzmk_ast() /* how to create an AST node (user-defined) */ zzd_ast() /* how to destroy an AST node */ LA(i) /* ith token of look-ahead */ LATEXT(i) /* text for ith token of look-ahead */ zzlextext /* buffer holding text of current token */ zzsyn() /* error reporting macro */ zzpre_ast(tree, func1, func2, func3) /* see section 4.2.5 */ zzfree_ast(tree) /* see section 4.2.5 */ zzdup_ast(tree) /* see section 4.2.5 */ zzedecode(unsigned *error_set); /* decode a bit-set */ zzskip() /* Tell DLG to skip this token */ zzmore() /* Tell DLG to try for more chars */ zzmode(mode) /* switch DLG to another lex mode */ zzadvance() /* Get next char allow only valid chars */ zzrdstream(s) /* Reset lex_line, use stream s for input */ zzrdfunc(f) /* Reset lex_line, call function f to read input */ zzreplchar(c) /* replace current lex buffer with c */ zzreplstr(s) /* replace current lex buffer with s */ 6.1.2. Global variables char *zztokens[], /* zztokens[t] is rexpr or label for token t */ *zzbegexpr, /* start of text for currently recognized expr */ *zzendexpr; /* end of text for currently recognized expr */ int zzline, /* set by user to current line # */ zzbegcol, /* column # of 1st char in token */ zzendcol, /* column # of last char in token */ zzchar; /* current character of look-ahead */ void (*zzerr)(); /* ptr to func to exec upon lexical error */ One of the most common actions found among PCCTS descriptions is a statement similar to Page 81 PCCTS 1.00 #token "\n" << zzline++; >> /* Move to next line */ 6.2. ANSI versus K&R K&R, actually Kernighan and Ritchie, "The C Programming Language," Prentice-Hall, 1978, was effectively the standard for the C language until the ANSI X3-J11 committee put forth a definition of ANSI C [ANS90]. Unfortunately, ANSI C is not compatible with K&R C, yet both dialects are in common use. For this reason, PCCTS supports generation of either ANSI C or K&R C. By default, PCCTS generates K&R C code with parameterless extern declarations automatically placed in tokens.h for each parsing func- tion that has a return value. If the -ga command-line option is used (on both ANTLR and DLG), PCCTS generates ANSI C compatible code and function prototype declarations; all parsing functions with return values or parameter definitions have appropriate prototypes placed in tokens.h. According to the ANSI C standard, a compliant compiler must define the macro name __STDC__, which would not be defined by a K&R C compiler. Hence, code can be written to work correctly with both ANSI C and K&R C by using #ifdef to test if __STDC__ has been defined. Although the output of PCCTS would have been excessively large if #ifdef had been used to make the code be acceptable to both ANSI C and K&R C compilers, #include files and other support code may test for the macro __STDC__. 6.3. Bugs Occasionally, PCCTS will find a grammar that it cannot analyze and you will receive an error similar to the following. file.c, line 43: out of memory while analyzing alts 2 and 5 of (..)+ This is due to LL(k) grammar analysis requiring exponential space in some cases. Normally, this problem only will occur with k > 2 for large, ambiguous, grammars. A technique which avoids the worst-case exponential space requirement is being considered for a future release. Although not technically a "bug," the differences between B release PCCTS and version 1.00 will cause numerous errors to be Page 82 PCCTS 1.00 reported if old grammars are submitted to the new system. Most of these incompatibilities require only minor changes; see section 7.1. This was expected - why do you think we called it B? The code generated using the -ga option has not been tested for compliance with the ANSI C standard. The source code for ANTLR itself is not ANSI compliant, although it is fairly close. Error messages reporting ambiguities involving nullable subrules (i.e., {} or ()*) sometime blame the ambiguity on an alternative rather than the nullable construct. The names of PCCTS variables and/or functions are not necessarily consistent, mnemonic or convenient. PCCTS symbols are, therefore, subject to change. ANTLR will not compile on any system that does not have _iobuf as the standard FILE * structure name. This is because struct _iobuf * is used in a header file instead of FILE *. The header file is automatically generated via proto--the C example distributed with PCCTS. You can circumvent this problem by changing the declaration of struct _iobuf * in proto.h to FILE *. ANTLR normally checks to see that you do not reference an attri- bute (parameter, return value, normal attribute) that is not defined. However, if you use a $name where name is a substring of a valid parameter or return value, it thinks that it is ok. Invoking an ANTLR parser multiple times or invoking multiple parsers does not work very well; tokens tend to be lost. 7. History PCCTS Version 1.00 was written using B release PCCTS - 1.0B. 1.0B was, in turn, written in alpha release PCCTS, 1.0A, which was written in YUCC (a graduate class project gone wild). YUCC was writ- ten in PG; both of which were simple BNF-style recursive descent parser generators. PG was written in RD [JuD84]. 7.1. Differences from B release PCCTS PCCTS Version 1.00 is significantly different from B release PCCTS. However, Version 1.00 is mostly a superset - most old grammars require only minor changes for version 1.00. This section delineates all of the changes and a few possible migration paths. o #attrib is now called #header to reflect the more general use of the construct in version 1.00. o The ANTLR initiation macro now requires that the starting rule have () appended just as you would call the function by hand. For example, ANTLR(start(), stdin). Page 83 PCCTS 1.00 o A number of internal functions and variables have been renamed so that conflicts with user-defined names are less likely. All the new names begin with the prefix zz. For example: o aCreate() is now zzcr_attr(). o cur_char is now called zzchar, but the user should not need to reference it. o LexSkip and LexMore are now zzskip and zzmore. o Token is now called LA(1) (first token of look-ahead). LA(i) is the ith token. o LexText is now LATEXT(1) (text for first token of look-ahead). LATEXT(i) is the text for the ith token. o ANTLRi no longer exists. o Rules no longer automatically define an attribute parameter. Also, parameters of the form {...} (i.e. rule[{...}]) cannot be used. Basically, parameter passing to a rule is exactly like in a programming language. All parameters must be defined and can be any valid type. Also, return values may be defined for rules rather than using old-style upward attributes. Rules may simu- late the old-style $0 inheritance via rule[Attrib inh] : <<$0 = inh;>> ... ; o A new file called mode.h is created by DLG and is included into your ANTLR parser. This implies that DLG must be run before you can compile the C files created by ANTLR. o New command-line options have appeared, old ones have been changed and/or disappeared. To be precise, -T is now -gd. -z is gone. -n is now -gx. -l is now -fl. -c is now -gc. -I is gone. Page 84 PCCTS 1.00 -s is gone. -fi is gone. -fo is gone. -R is now -cr. -e is gone. -a is gone. -w is now one of -e1, -e2, -e3. -d is gone. -E is gone since the user can redefine zzsyn(). o zzreplchar() and zzreplstr() should be used instead of modifying the lexical buffer manually. o A pointer to a function called zzerr is set to a function to call upon lexical error (invalid token etc...). 7.2. Future directions Neither we nor Purdue University guarantee that PCCTS works or accept liability if it does not do what you want it to. However, we do try to maintain and improve the system. So, we value user feedback that might help us improve PCCTS. Within Purdue University's School of Electrical Engineering, we are constantly updating PCCTS. The 1.00 release does not include all the latest goodies - it contains only those features whose implementa- tion we believe to be useful, correct, and reasonably stable. The following is a partial list of improvements currently being con- sidered: Parser construction Currently, ANTLR parsers use a sequence of if statements to find the correct alternative to match. This is inefficient if only one token of look-ahead is required to uniquely determine which alternative to choose. Depending on the C compiler, a switch statement would generate much faster code. Attribute and AST variables are maintained on software stacks in 1.00. Grammar analysis could easily create a collection of local hardware-stack-based variables to use in place of the software stack. This would eliminate the need to specify an attribute or AST stack size and produce faster code. Inline rules For the most part, subrules are like macro-expansions of rules. Rather than referencing a rule, the user simply instantiates the Page 85 PCCTS 1.00 rule in place of the rule reference. Grammatically the two methods are the same, but they differ in two important ways. First, subrules cannot employ the sophisticated parameter passing mechanism available to rules. Second, subrules are much faster than rule references since subrules avoid the function call over- head. A future version of PCCTS might introduce inline rules. Inline rules would be defined as before (with the addition of the inline keyword) but would be instantiated into rules that refer- ence them. This would cut down on grammar clutter by breaking up long rules while maintaining the speed advantage of subrules. Abstract-syntax-trees Through experience with PCCTS AST's, a new group of tree manipu- lation routines should appear. Some of them could relate to code-generation. Example grammars More and more grammars are being built with PCCTS. Some of them might be made available through future PCCTS releases and through users' network postings. Recognition Currently, PCCTS parsers cannot recognize as large a class of languages as bottom-up parsers. Research is underway to create hybrid parsers that would retain top-down's programmer-interface features while gaining bottom-up's recognition strength. Code generation Research continues at Purdue toward a code-generator generator for parallel machines (PIG). A simple uni-processor code-generator that has been in use at Purdue may be made available in the next release of PCCTS. $tokenPCCTS only recognizes $i, $i.j, $rule variables presently. Future versions might allow $token variables which will refer to attributes for tokens by the name of the token not the position number. 7.3. Acknowledgements Thanks is due to the users of B release PCCTS for their excellent suggestions. In particular, Mark Nichols (PhD candidate in Electrical Engineering at Purdue) discovered many of the initial quirks and bugs. We acknowledge Dana Hoggatt (Micro Data Base Systems Inc) for his idea of error grouping (strings attached to non-terminals) and his signifi- cant software testing efforts. Thanks is due Professor Matthew O'Keefe and Peter Dahl (both at University of Minnesota) for their review of the user's manual and their bug-finding escapades into PCCTS. Page 86 PCCTS 1.00 8. References [ANS90] American National Standard for Information Systems, "Programming Language C," Document X3J11/90-013, February 1990. [AhU79] A.V. Aho, J.D. Ullman, Principles of Compiler Design, Addison- Wesley Publishing Company, Reading, Massachusetts, 1979. [DoP90] H. Dobler and K. Pirklbauer, "Coco-2 A New Compiler Compiler," ACM SIGPLAN Notices, Vol. 25, No. 5, May 1990. [Joh78] Stephen C. Johnson, "Yacc: Yet Another Compiler-Compiler," Bell Laboratories, Murray Hill, NJ, 1978. [JuD84] R. Juels, H. Dietz, S. Arbeeny, J. Patane, E. Tepe, "Automatic Translation Systems," Study Report, Center For Digital Systems, Department of Electrical Engineering and Computer Science, The Polytechnic Institute of New York, New York, 1984. [KeR88] Brian W. Kernighan and Dennis M. Ritchie, "The C programming Language," Prentice Hall Inc., Englewood Cliffs, New Jersey, 1988. [Les75] M. E. Lesk, "LEX--a Lexical Analyzer Generator," CSTR 39 Bell Laboratories, Murray Hill, NJ, 1975. [LeP81] Harry R. Lewis, Christos H. Papadimitriou, Elements of the Theory of Computation, Prentice Hall, Inc., Englewood Cliffs, New Jer- sey, 1981. [Nau63] P. Naur (ed.), "Revised report on the algorithmic language ALGOL 60," Communications of the ACM 6:1, 1-17. [PaD90] Terence Parr, Henry Dietz, Will Cohen, "Purdue Compiler- Construction Tool Set," Technical Report TR-EE 90-14, School Of Electrical Engineering, Purdue University, West Lafayette, IN, February 1990. [Par90] Terence Parr, "The Analysis of Extended BNF Grammars and the Con- struction of LL(1) Parsers," Technical Report TR-EE 90-30, School Of Electrical Engineering, Purdue University, West Lafayette, IN, May 1990. Page 87 PCCTS 1.00 [PuC89] James J. Purtilo and John R. Callahan, "Parse-Tree Annotations", Communications of the ACM, Vol. 32, No. 12, December 1989. [Str87] Bjarne Stroustrup, "The C++ Programming Language," Addison-Wesley Publishing Company, Reading, Massachusetts, 1987. [Wir76] Niklaus Wirth, Algorithms + Data Structures = Programs, Prentice Hall, Inc., Englewood Cliffs, New Jersey, 1976. Notes:[LeP81] credits P. Naur [Nau63] with Backus-Naur Form (BNF) and is a reasonable text on language theory. [AhU79] gave more com- plete references to Johnson's YACC and Lesk's LEX - i.e. the CSTR numbers from Bell Laboratories. PCCTS 1.00 Table of Contents 1. Introduction ................................................. 1 1.1. PCCTS programming interface ....................... 2 1.2. PCCTS information flow ............................ 3 1.3. PCCTS file(s) format .............................. 3 1.4. Makefile template ................................. 5 1.5. PCCTS component command-line options .............. 6 1.6. Introductory example .............................. 10 2. Lexical analysis and token definition ........................ 13 2.1. Token Labeling and token actions .................. 13 2.2. Lexical actions via #lexaction .................... 16 2.3. Multiple lexical classes .......................... 16 2.3.1. Multiple grammars, multiple lexical analyzers ....................................................... 17 2.3.2. Single grammar, multiple lexical analyzers ....................................................... 18 2.4. Handling end of input ............................. 18 2.5. Token order and lexical ambiguities ............... 19 2.6. Quoted tokens ..................................... 19 2.7. Interactive lexical analyzers ..................... 21 2.8. DLG lexical input ................................. 21 2.9. Tracking pattern position ......................... 22 3. Syntactic analysis ........................................... 22 3.1. PCCTS rule definitions ............................ 23 3.1.1. Subrules ............................... 24 PCCTS 1.00 3.1.2. Rule communication ..................... 25 3.1.3. Miscellaneous notes .................... 27 3.1.4. LL(k) parsing .......................... 28 3.2. Grammar ambiguities ............................... 28 3.3. Look-ahead size ................................... 30 3.4. ANTLR parser construction ......................... 30 3.4.1. Efficiency ............................. 31 3.4.2. Debugging parsers ...................... 32 3.5. PCCTS parser template ............................. 32 3.6. Grammatical actions ............................... 34 3.6.1. Init Actions ........................... 34 3.6.2. Fail actions ........................... 35 3.6.3. Actions appearing outside of rules ..... 36 4. Attribute handling ........................................... 37 4.1. General attribute handling ........................ 37 4.1.1. Attribute creation ..................... 38 4.1.2. Attribute destruction .................. 40 4.1.3. Standard attribute definitions ......... 41 4.1.4. Attribute references ................... 42 4.1.4.1. $i references ............... 43 4.1.4.2. $i.j references ............. 44 4.1.4.3. $label references ........... 45 4.1.5. $[token, text] attribute constructor 4.1.6. Inheritance ............................ 47 4.1.6.1. $0 inheritance .............. 47 4.1.6.2. Generalized rule communi- cation .......................................................... 48 4.1.6.2.1. Downward- PCCTS 1.00 inheritance ..................................................... 48 4.1.6.2.2. Upward- inheritance ..................................................... 50 4.2. Abstract-syntax-tree construction and manipula- tion ............................................................ 50 4.2.1. AST node creation ...................... 51 4.2.2. AST node destruction ................... 54 4.2.3. Automatic AST construction ............. 55 4.2.3.1. Grammar annotation .......... 55 4.2.3.2. Operator precedence ......... 57 4.2.3.3. Example ..................... 58 4.2.4. Explicit AST construction .............. 61 4.2.4.1. Simple example .............. 62 4.2.4.2. C declaration to English translator ...................................................... 64 4.2.5. Tree manipulation routines ............. 69 4.2.6. AST references ......................... 71 4.2.7. Invoking parsers that construct trees 4.2.8. Error recovery and AST's ............... 73 5. Error detection and repair ................................... 73 5.1. Modifying grammars to recognize common syntax errors .......................................................... 74 5.2. Default error reporting function .................. 75 5.3. Error groups ...................................... 75 5.4. Error token classes ............................... 76 5.4.1. Error class definitions ................ 77 5.4.2. Error class utilization ................ 79 6. Things you should know about ................................. 80 6.1. Available program symbols ......................... 80 PCCTS 1.00 6.1.1. Macros, functions, #define's ........... 80 6.1.2. Global variables ....................... 81 6.2. ANSI versus K&R ................................... 82 6.3. Bugs .............................................. 82 7. History ...................................................... 83 7.1. Differences from B release PCCTS .................. 83 7.2. Future directions ................................. 85 7.3. Acknowledgements .................................. 86 8. References ................................................... 87